Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Sat, Aug 10, 2013 at 05:36:07PM +0200, Marek Olšák wrote: On Sat, Aug 10, 2013 at 5:09 PM, Christian König deathsim...@vodafone.de wrote: Am 10.08.2013 15:53, schrieb Marek Olšák: The RCU approach sounds good, but you can never know if 16 is enough. We should release the buffer once it is full and allocate a new one. The cache bufmgr in the winsys will assure there won't be any buffer allocation overhead - it would work kinda a like a ring of buffers. Are you sure of that? The overhead of allocating a new buffer was what always looked so unfriendly to me with this approach. Absolutely. We've had this optimization since 2010 or so. As long as you use winsys-buffer_create(use_reusable_pool=TRUE), you're fine. A deleted buffer is added to a list of deleted buffers and if nobody reclaims it, it will be released after 1 second. During that 1 second, the buffer must first become idle and then anybody can reclaim it with the same buffer_create call. If the buffer had been mapped before, it's still mapped, so there is even no map-buffer overhead. This is a crucial part of our driver stack that makes vertex uploading so efficient. Don't underestimate the Radeon winsys. :) Just an fyi how we do that for libdrm_intel: We have an madv ioctl, and every buffer in the allocation cache is marked purgeable. When memory gets tight the kernel can drop the backing storage on the floor right away for such buffers (and we preferrentially reap such objects from our shrinker). When reusing such a buffer libdrm calls madv(WILLNEED) to make sure the buffer is still around. If it got reaped libdrm cleans up the entire cache and checks for other reaped buffer objects. With this you don't need tunables like the 1s delay you're using and instead cache size is directly driven by vm pressure. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Sat, Aug 10, 2013 at 11:09 AM, Christian König deathsim...@vodafone.de wrote: Am 10.08.2013 15:53, schrieb Marek Olšák: The RCU approach sounds good, but you can never know if 16 is enough. We should release the buffer once it is full and allocate a new one. The cache bufmgr in the winsys will assure there won't be any buffer allocation overhead - it would work kinda a like a ring of buffers. Are you sure of that? The overhead of allocating a new buffer was what always looked so unfriendly to me with this approach. On the other hand the CP definitely can't handle more than 8 contexts at the same time (and one of them is always the clear context), so I strongly think we should be on the save side with 16 slots here. I'm just not sure if the SQ could add some more depth to our pipeline, maybe Alex knows more on this. IIRC, it's 8 contexts on larger cards and 4 on smaller cards. You can look up the details in the kernel driver. The contexts only affect the CONTEXT registers from the 3D pipeline. Alex Christian. Marek On Sat, Aug 10, 2013 at 10:45 AM, Christian König deathsim...@vodafone.de wrote: Am 09.08.2013 20:06, schrieb Marek Olšák: [SNIP] What if I kept the current emission code, and only allocated a new buffer at the end of the emit function, copied all descriptors to it using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The buffer where the descriptors are updated using WRITE_DATA would act as a staging buffer only and shaders would always read from the fresh new copy. Does that sound good to you? That sounds like the solution with multiple buffers I already suggest, but I would rather use some RCU approach to it. Basically we just have to handle it like the context based resources on earlier asics. So to me a proper solution should look something like this: We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the first slot with WRITE_DATA packets and then use it in a draw command. As soon as any of the descriptors is about to change we copy it's content to the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary updates using WRITE_DATA and then use it in a draw command. This repeats over and over again, all we need to make sure is that we have enough slots in the ring to be sure that we never override descriptors when they are still in use, but I'm pretty sure that we should be on the save side with 16 or so. We can even prepare the commands for the switch from one slot to the next only once and then use it for the whole lifetime of the driver. Christian. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 09.08.2013 20:06, schrieb Marek Olšák: [SNIP] What if I kept the current emission code, and only allocated a new buffer at the end of the emit function, copied all descriptors to it using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The buffer where the descriptors are updated using WRITE_DATA would act as a staging buffer only and shaders would always read from the fresh new copy. Does that sound good to you? That sounds like the solution with multiple buffers I already suggest, but I would rather use some RCU approach to it. Basically we just have to handle it like the context based resources on earlier asics. So to me a proper solution should look something like this: We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the first slot with WRITE_DATA packets and then use it in a draw command. As soon as any of the descriptors is about to change we copy it's content to the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary updates using WRITE_DATA and then use it in a draw command. This repeats over and over again, all we need to make sure is that we have enough slots in the ring to be sure that we never override descriptors when they are still in use, but I'm pretty sure that we should be on the save side with 16 or so. We can even prepare the commands for the switch from one slot to the next only once and then use it for the whole lifetime of the driver. Christian. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
The RCU approach sounds good, but you can never know if 16 is enough. We should release the buffer once it is full and allocate a new one. The cache bufmgr in the winsys will assure there won't be any buffer allocation overhead - it would work kinda a like a ring of buffers. Marek On Sat, Aug 10, 2013 at 10:45 AM, Christian König deathsim...@vodafone.de wrote: Am 09.08.2013 20:06, schrieb Marek Olšák: [SNIP] What if I kept the current emission code, and only allocated a new buffer at the end of the emit function, copied all descriptors to it using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The buffer where the descriptors are updated using WRITE_DATA would act as a staging buffer only and shaders would always read from the fresh new copy. Does that sound good to you? That sounds like the solution with multiple buffers I already suggest, but I would rather use some RCU approach to it. Basically we just have to handle it like the context based resources on earlier asics. So to me a proper solution should look something like this: We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the first slot with WRITE_DATA packets and then use it in a draw command. As soon as any of the descriptors is about to change we copy it's content to the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary updates using WRITE_DATA and then use it in a draw command. This repeats over and over again, all we need to make sure is that we have enough slots in the ring to be sure that we never override descriptors when they are still in use, but I'm pretty sure that we should be on the save side with 16 or so. We can even prepare the commands for the switch from one slot to the next only once and then use it for the whole lifetime of the driver. Christian. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 10.08.2013 15:53, schrieb Marek Olšák: The RCU approach sounds good, but you can never know if 16 is enough. We should release the buffer once it is full and allocate a new one. The cache bufmgr in the winsys will assure there won't be any buffer allocation overhead - it would work kinda a like a ring of buffers. Are you sure of that? The overhead of allocating a new buffer was what always looked so unfriendly to me with this approach. On the other hand the CP definitely can't handle more than 8 contexts at the same time (and one of them is always the clear context), so I strongly think we should be on the save side with 16 slots here. I'm just not sure if the SQ could add some more depth to our pipeline, maybe Alex knows more on this. Christian. Marek On Sat, Aug 10, 2013 at 10:45 AM, Christian König deathsim...@vodafone.de wrote: Am 09.08.2013 20:06, schrieb Marek Olšák: [SNIP] What if I kept the current emission code, and only allocated a new buffer at the end of the emit function, copied all descriptors to it using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The buffer where the descriptors are updated using WRITE_DATA would act as a staging buffer only and shaders would always read from the fresh new copy. Does that sound good to you? That sounds like the solution with multiple buffers I already suggest, but I would rather use some RCU approach to it. Basically we just have to handle it like the context based resources on earlier asics. So to me a proper solution should look something like this: We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the first slot with WRITE_DATA packets and then use it in a draw command. As soon as any of the descriptors is about to change we copy it's content to the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary updates using WRITE_DATA and then use it in a draw command. This repeats over and over again, all we need to make sure is that we have enough slots in the ring to be sure that we never override descriptors when they are still in use, but I'm pretty sure that we should be on the save side with 16 or so. We can even prepare the commands for the switch from one slot to the next only once and then use it for the whole lifetime of the driver. Christian. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Sat, Aug 10, 2013 at 5:09 PM, Christian König deathsim...@vodafone.de wrote: Am 10.08.2013 15:53, schrieb Marek Olšák: The RCU approach sounds good, but you can never know if 16 is enough. We should release the buffer once it is full and allocate a new one. The cache bufmgr in the winsys will assure there won't be any buffer allocation overhead - it would work kinda a like a ring of buffers. Are you sure of that? The overhead of allocating a new buffer was what always looked so unfriendly to me with this approach. Absolutely. We've had this optimization since 2010 or so. As long as you use winsys-buffer_create(use_reusable_pool=TRUE), you're fine. A deleted buffer is added to a list of deleted buffers and if nobody reclaims it, it will be released after 1 second. During that 1 second, the buffer must first become idle and then anybody can reclaim it with the same buffer_create call. If the buffer had been mapped before, it's still mapped, so there is even no map-buffer overhead. This is a crucial part of our driver stack that makes vertex uploading so efficient. Don't underestimate the Radeon winsys. :) On the other hand the CP definitely can't handle more than 8 contexts at the same time (and one of them is always the clear context), so I strongly think we should be on the save side with 16 slots here. I'm just not sure if the SQ could add some more depth to our pipeline, maybe Alex knows more on this. If we needed only 16 slots, that would be even better. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 08.08.2013 21:38, schrieb Alex Deucher: On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote: On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. You'll need to either store them in memory or re-emit them if you store them in the IB. The CE is mainly there so that it can prime the TC in parallel with the command stream processing. Yeah indeed. The CE is just for prefetching everything into caches and doesn't really help here. The only two options I see is either fully emitting it into the command stream whenever anything changes or allocating a new buffer for the resources on each new draw call, copying over the old state and then setting just the things that changed. Both options have their pro and cons, no idea what
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Fri, Aug 9, 2013 at 10:34 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 21:38, schrieb Alex Deucher: On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote: On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. You'll need to either store them in memory or re-emit them if you store them in the IB. The CE is mainly there so that it can prime the TC in parallel with the command stream processing. Yeah indeed. The CE is just for prefetching everything into caches and doesn't really help here. The only two options I see is either fully emitting it into the command stream whenever anything changes or
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 09.08.2013 15:29, schrieb Marek Olšák: On Fri, Aug 9, 2013 at 10:34 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 21:38, schrieb Alex Deucher: On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote: On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call.. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. You'll need to either store them in memory or re-emit them if you store them in the IB. The CE is mainly there so that it can prime the TC in parallel with the command stream processing. Yeah indeed. The CE is just for prefetching everything into caches and doesn't really help here. The only two options I see is either fully emitting it into the command stream whenever anything changes or allocating a new buffer for the resources on each new draw call,
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Fri, Aug 9, 2013 at 3:48 PM, Christian König deathsim...@vodafone.de wrote: Am 09.08.2013 15:29, schrieb Marek Olšák: On Fri, Aug 9, 2013 at 10:34 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 21:38, schrieb Alex Deucher: On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote: On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call.. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. You'll need to either store them in memory or re-emit them if you store them in the IB. The CE is mainly there so that it can prime the TC in parallel with the command stream processing. Yeah indeed. The CE is just for prefetching everything into
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. Regards, Christian. - descriptors of disabled resources are set to zeros - fine-grained resource updates (it can update one resource slot while not touching the other slots) - updates are done with the WRITE_DATA packet - it implements the si_atom interface for packet emission - only used for FMASK textures right now The primary motivation for this is that FMASK textures naturally need fine-grained resource updates and I also need to query in the shader if a resource is NULL. --- src/gallium/drivers/radeonsi/Makefile.sources | 1 + src/gallium/drivers/radeonsi/r600_hw_context.c | 3 + src/gallium/drivers/radeonsi/r600_resource.h | 1 + src/gallium/drivers/radeonsi/r600_texture.c| 1 + src/gallium/drivers/radeonsi/radeonsi_pipe.c | 9 +- src/gallium/drivers/radeonsi/radeonsi_pipe.h | 6 +- src/gallium/drivers/radeonsi/radeonsi_pm4.c| 7 + src/gallium/drivers/radeonsi/radeonsi_pm4.h| 2 + src/gallium/drivers/radeonsi/si_descriptors.c | 188 + src/gallium/drivers/radeonsi/si_state.c| 58 +++- src/gallium/drivers/radeonsi/si_state.h| 36 + 11 files changed, 305 insertions(+), 7 deletions(-) create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c diff --git a/src/gallium/drivers/radeonsi/Makefile.sources b/src/gallium/drivers/radeonsi/Makefile.sources index b3ffa72..68c8282 100644 --- a/src/gallium/drivers/radeonsi/Makefile.sources +++ b/src/gallium/drivers/radeonsi/Makefile.sources @@ -10,6 +10,7 @@ C_SOURCES := \ r600_translate.c \ radeonsi_pm4.c \ radeonsi_compute.c \ + si_descriptors.c \ si_state.c \ si_state_streamout.c \ si_state_draw.c \ diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index 7ed7496..b595477 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) * next draw command */ si_pm4_reset_emitted(ctx); + + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]); + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]); } void si_context_emit_fence(struct r600_context *ctx, struct si_resource *fence_bo, unsigned offset, unsigned value) diff --git a/src/gallium/drivers/radeonsi/r600_resource.h b/src/gallium/drivers/radeonsi/r600_resource.h index e5dd36a..ab5c7b7 100644 --- a/src/gallium/drivers/radeonsi/r600_resource.h +++ b/src/gallium/drivers/radeonsi/r600_resource.h @@ -44,6 +44,7 @@ struct r600_fmask_info { unsigned offset; unsigned size; unsigned alignment; + unsigned pitch; unsigned bank_height; unsigned slice_tile_max; unsigned tile_mode_index; diff --git a/src/gallium/drivers/radeonsi/r600_texture.c b/src/gallium/drivers/radeonsi/r600_texture.c index cd3d1aa..b613564 100644 --- a/src/gallium/drivers/radeonsi/r600_texture.c +++ b/src/gallium/drivers/radeonsi/r600_texture.c @@ -463,6 +463,7 @@ static void r600_texture_get_fmask_info(struct r600_screen *rscreen, out-slice_tile_max -= 1; out-tile_mode_index = fmask.tiling_index[0]; + out-pitch = fmask.level[0].nblk_x; out-bank_height = fmask.bankh; out-alignment = MAX2(256, fmask.bo_alignment); out-size = fmask.bo_size; diff --git a/src/gallium/drivers/radeonsi/radeonsi_pipe.c b/src/gallium/drivers/radeonsi/radeonsi_pipe.c index ad955e3..3112124 100644 --- a/src/gallium/drivers/radeonsi/radeonsi_pipe.c +++
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
.On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? Marek Regards, Christian. - descriptors of disabled resources are set to zeros - fine-grained resource updates (it can update one resource slot while not touching the other slots) - updates are done with the WRITE_DATA packet - it implements the si_atom interface for packet emission - only used for FMASK textures right now The primary motivation for this is that FMASK textures naturally need fine-grained resource updates and I also need to query in the shader if a resource is NULL. --- src/gallium/drivers/radeonsi/Makefile.sources | 1 + src/gallium/drivers/radeonsi/r600_hw_context.c | 3 + src/gallium/drivers/radeonsi/r600_resource.h | 1 + src/gallium/drivers/radeonsi/r600_texture.c| 1 + src/gallium/drivers/radeonsi/radeonsi_pipe.c | 9 +- src/gallium/drivers/radeonsi/radeonsi_pipe.h | 6 +- src/gallium/drivers/radeonsi/radeonsi_pm4.c| 7 + src/gallium/drivers/radeonsi/radeonsi_pm4.h| 2 + src/gallium/drivers/radeonsi/si_descriptors.c | 188 + src/gallium/drivers/radeonsi/si_state.c| 58 +++- src/gallium/drivers/radeonsi/si_state.h| 36 + 11 files changed, 305 insertions(+), 7 deletions(-) create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c diff --git a/src/gallium/drivers/radeonsi/Makefile.sources b/src/gallium/drivers/radeonsi/Makefile.sources index b3ffa72..68c8282 100644 --- a/src/gallium/drivers/radeonsi/Makefile.sources +++ b/src/gallium/drivers/radeonsi/Makefile.sources @@ -10,6 +10,7 @@ C_SOURCES := \ r600_translate.c \ radeonsi_pm4.c \ radeonsi_compute.c \ + si_descriptors.c \ si_state.c \ si_state_streamout.c \ si_state_draw.c \ diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index 7ed7496..b595477 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) * next draw command */ si_pm4_reset_emitted(ctx); + + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]); + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]); } void si_context_emit_fence(struct r600_context *ctx, struct si_resource *fence_bo, unsigned offset, unsigned value) diff --git a/src/gallium/drivers/radeonsi/r600_resource.h b/src/gallium/drivers/radeonsi/r600_resource.h index e5dd36a..ab5c7b7 100644 --- a/src/gallium/drivers/radeonsi/r600_resource.h +++ b/src/gallium/drivers/radeonsi/r600_resource.h @@ -44,6 +44,7 @@ struct r600_fmask_info { unsigned offset; unsigned size; unsigned alignment; + unsigned pitch; unsigned bank_height; unsigned slice_tile_max; unsigned tile_mode_index; diff --git a/src/gallium/drivers/radeonsi/r600_texture.c b/src/gallium/drivers/radeonsi/r600_texture.c index cd3d1aa..b613564 100644 --- a/src/gallium/drivers/radeonsi/r600_texture.c +++ b/src/gallium/drivers/radeonsi/r600_texture.c @@ -463,6 +463,7 @@ static void r600_texture_get_fmask_info(struct r600_screen *rscreen,
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. Cheers, Christian. Marek Regards, Christian. - descriptors of disabled resources are set to zeros - fine-grained resource updates (it can update one resource slot while not touching the other slots) - updates are done with the WRITE_DATA packet - it implements the si_atom interface for packet emission - only used for FMASK textures right now The primary motivation for this is that FMASK textures naturally need fine-grained resource updates and I also need to query in the shader if a resource is NULL. --- src/gallium/drivers/radeonsi/Makefile.sources | 1 + src/gallium/drivers/radeonsi/r600_hw_context.c | 3 + src/gallium/drivers/radeonsi/r600_resource.h | 1 + src/gallium/drivers/radeonsi/r600_texture.c| 1 + src/gallium/drivers/radeonsi/radeonsi_pipe.c | 9 +- src/gallium/drivers/radeonsi/radeonsi_pipe.h | 6 +- src/gallium/drivers/radeonsi/radeonsi_pm4.c| 7 + src/gallium/drivers/radeonsi/radeonsi_pm4.h| 2 + src/gallium/drivers/radeonsi/si_descriptors.c | 188 + src/gallium/drivers/radeonsi/si_state.c| 58 +++- src/gallium/drivers/radeonsi/si_state.h| 36 + 11 files changed, 305 insertions(+), 7 deletions(-) create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c diff --git a/src/gallium/drivers/radeonsi/Makefile.sources b/src/gallium/drivers/radeonsi/Makefile.sources index b3ffa72..68c8282 100644 --- a/src/gallium/drivers/radeonsi/Makefile.sources +++ b/src/gallium/drivers/radeonsi/Makefile.sources @@ -10,6 +10,7 @@ C_SOURCES := \ r600_translate.c \ radeonsi_pm4.c \ radeonsi_compute.c \ + si_descriptors.c \ si_state.c \ si_state_streamout.c \ si_state_draw.c \ diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index 7ed7496..b595477 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) * next draw command */ si_pm4_reset_emitted(ctx); + + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]); + si_sampler_views_begin_new_cs(ctx, ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]); } void
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. Marek Cheers, Christian. Marek Regards, Christian. - descriptors of disabled resources are set to zeros - fine-grained resource updates (it can update one resource slot while not touching the other slots) - updates are done with the WRITE_DATA packet - it implements the si_atom interface for packet emission - only used for FMASK textures right now The primary motivation for this is that FMASK textures naturally need fine-grained resource updates and I also need to query in the shader if a resource is NULL. --- src/gallium/drivers/radeonsi/Makefile.sources | 1 + src/gallium/drivers/radeonsi/r600_hw_context.c | 3 + src/gallium/drivers/radeonsi/r600_resource.h | 1 + src/gallium/drivers/radeonsi/r600_texture.c| 1 + src/gallium/drivers/radeonsi/radeonsi_pipe.c | 9 +- src/gallium/drivers/radeonsi/radeonsi_pipe.h | 6 +- src/gallium/drivers/radeonsi/radeonsi_pm4.c| 7 + src/gallium/drivers/radeonsi/radeonsi_pm4.h| 2 + src/gallium/drivers/radeonsi/si_descriptors.c | 188 + src/gallium/drivers/radeonsi/si_state.c| 58 +++- src/gallium/drivers/radeonsi/si_state.h| 36 + 11 files changed, 305 insertions(+), 7 deletions(-) create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c diff --git a/src/gallium/drivers/radeonsi/Makefile.sources b/src/gallium/drivers/radeonsi/Makefile.sources index b3ffa72..68c8282 100644 --- a/src/gallium/drivers/radeonsi/Makefile.sources +++ b/src/gallium/drivers/radeonsi/Makefile.sources @@ -10,6 +10,7 @@ C_SOURCES := \ r600_translate.c \ radeonsi_pm4.c \
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. Christian. Marek Cheers, Christian. Marek Regards, Christian. - descriptors of disabled resources are set to zeros - fine-grained resource updates (it can update one resource slot while not touching the other slots) - updates are done with the WRITE_DATA packet - it implements the si_atom interface for packet emission - only used for FMASK textures right now The primary motivation for this is that FMASK textures naturally need fine-grained resource updates and I also need to query in the shader if a resource is NULL. --- src/gallium/drivers/radeonsi/Makefile.sources | 1 + src/gallium/drivers/radeonsi/r600_hw_context.c | 3 + src/gallium/drivers/radeonsi/r600_resource.h | 1 + src/gallium/drivers/radeonsi/r600_texture.c| 1 + src/gallium/drivers/radeonsi/radeonsi_pipe.c | 9 +- src/gallium/drivers/radeonsi/radeonsi_pipe.h | 6 +- src/gallium/drivers/radeonsi/radeonsi_pm4.c| 7 + src/gallium/drivers/radeonsi/radeonsi_pm4.h| 2 + src/gallium/drivers/radeonsi/si_descriptors.c | 188 + src/gallium/drivers/radeonsi/si_state.c| 58 +++- src/gallium/drivers/radeonsi/si_state.h| 36 + 11 files changed, 305 insertions(+), 7 deletions(-) create mode 100644
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup
On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote: On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 16:33, schrieb Marek Olšák: On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 14:38, schrieb Marek Olšák: .On Thu, Aug 8, 2013 at 9:47 AM, Christian König deathsim...@vodafone.de wrote: Am 08.08.2013 02:20, schrieb Marek Olšák: FMASK is bound as a separate texture. For every texture, there can be an FMASK. Therefore a separate array of resource slots has to be added. This adds a new mechanism for emitting resource descriptors, its features are: - resource descriptors are stored in an ordinary buffer (not in a CS) Having resource descriptors outside of the CS has two problems that we need to solve first: 1. Fine grained descriptor updates doesn't work, I already tried that. The problem is that unlike previous asics descriptors are now a memory block, so no longer part of the CP context. So when we (for example) have a draw command executing and the next draw command is using new resources for a specific slot we would either block until the first draw command is finished (which is bad for performance) or change the descriptors while they are still in use (which results in VM faults). So what would the proper solution be here? Do I need to flush some caches or would moving the descriptor updates to the constant IB fix that? Actually the current implementation worked better than anything else I tried. When you really need the resource descriptors in a separate buffer you need to use one buffer for each draw call and always write the full buffer contents (no partial updates). Flushing anything won't really help either.. The only solution I see using one buffer is to block until the last draw call is finished with WAIT_REG_MEM, but that would be quite disastrous for performance. 2. If my understand is correct when they are embedded the descriptors are preloaded into the caches while executing the IB, so to archive the same speed with descriptors outside of the IB you need to add additional commands to the constant IB which is new to SI and we currently doesn't support in the CS interface. There seems to be support for the constant IB. The CS ioctl chunk ID is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in si_vm_packet3_ce_check. Is there anything missing? The userspace side seems to be missing and except for throwing NOP packets into it we never tested it. I know from the closed source side that it actually was quite tricky for them to get working. Additional to that please note that I'm not 100% sure that just putting the descriptors into the IB is really helping here. It was just the most simplest solution to avoid allocating a new buffer on each draw call. I understand. I don't really need to have resource descriptors in a separate buffer, all I need is these 3 basic features a gallium driver should support: - fine-grained resource updates (mainly for performance, see below) - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) - no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound FYI, partial sampler view and sampler state updates are coming to gallium, Brian Paul already has some patches, it's just a matter of time now. Vertex and constant buffer states already support partial updates. That shouldn't be to much off a problem. Just allocate a state at startup and initialize it with the proper pm4 commands for 16 samplers, then update the resource descriptors in that state when we change the bound textures/samplers/views/constants/whatever. All we need to do then is setting the emitted state to NULL so that it gets re-emitted in the next draw command. That would re-emit all 16 shader resources even if just one of them needs to be changed. I was trying to avoid this inefficiency. Is it really impossible to emit just one resource descriptor and keep the others unchanged? This is a basic D3D10/11 feature, for example: void ID3D11DeviceContext::VSSetShaderResources( [in] UINT StartSlot, [in] UINT NumViews, [in] ID3D11ShaderResourceView *const *ppShaderResourceViews ); If the constant engine is required to implement this interface efficiently, then I'd like to work on constant IB support. You'll need to either store them in memory or re-emit them if you store them in the IB. The CE is mainly there so that it can prime the TC in parallel with the command stream processing. Alex ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev