Re: [Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.
Hi, Could you also give dual & single shaders different names when dumping them, e.g. something like the attached patch? - Eero On 23.02.2018 10:36, Kenneth Graunke wrote: Normally, SIMD8 tessellation evaluation shaders operate on a single patch, with each channel operating on a different vertex within the patch. For patch primitives with fewer than 8 vertices, this means that some of the channels are disabled, effectively wasting compute power. This is a fairly common case. To solve this, Skylake and later add "dual-patch" domain shader support. Dual-patch mode only supports PATCHLIST_1..4. The first four channels process vertices in the first patch, while the second four channels process vertices from a second patch. This can double the throughput. Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts two KSPs - one for single-patch mode, and one for dual-patch mode. The hardware will dynamically dispatch whichever kernel makes sense. The two shader modes are nearly identical. The three differences are: - There are two URB handles instead of one. - Dual-patch has an extra payload register (g1) for PrimitiveID. (single-patch packs it in g0, but they couldn't fit two IDs there) - Pushed input data arrives interleaved rather than packed (in SIMD4x2 style rather than SIMD4x1 style). For example, varyings 'a' and 'b' would show up as follows: Single patch (SIMD4x1 style inputs): g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w | Dual patch (SIMD4x2 style inputs): g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w | g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w | This is fairly easy to adjust for - in fact, we can take the FS IR we already generated for single-patch mode and transform it to create a dual-patch shader. We only need to repeat register allocation. Our load_input handling for TES shaders always emits MOVs to expand a scalar input to a full SIMD8 register. While these may be copy propagated away, it does mean all input file access will be a scalar <0,1,0> region. (The FS backend in theory could recognize things like dot(input1, input2) and emit a vector DP4 operation, but it does not do such things today, nor is it likely to gain such support.) Likewise, our URB access reads a single 32-bit URB handle from the payload, expanding it to 8 handles for the SIMD8 URB messages. We can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate the first four times, and the second another four times. Despite having a register for PrimitiveID, the documentation says dual-patch mode is not allowed when PrimitiveID is in use. So we don't need to handle that. This should be all that's required. Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365% (n=1062). --- src/intel/compiler/brw_compiler.h | 2 + src/intel/compiler/brw_fs.cpp | 98 ++- src/intel/compiler/brw_fs.h | 2 + src/intel/compiler/brw_fs_visitor.cpp | 1 + src/intel/compiler/brw_shader.cpp | 3 + src/mesa/drivers/dri/i965/genX_state_upload.c | 7 ++ 6 files changed, 112 insertions(+), 1 deletion(-) This is not that impressive in itself, but it seems like it can only help...the hardware ought to automatically dispatch in dual mode if half the channels in an invocation were going to be wasted. diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h index b1086bbcee5..311cff5a33d 100644 --- a/src/intel/compiler/brw_compiler.h +++ b/src/intel/compiler/brw_compiler.h @@ -986,6 +986,8 @@ struct brw_tes_prog_data enum brw_tess_partitioning partitioning; enum brw_tess_output_topology output_topology; enum brw_tess_domain domain; + + uint32_t prog_offset_dual_patch; }; struct brw_gs_prog_data diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp index bed632d21b9..02383280932 100644 --- a/src/intel/compiler/brw_fs.cpp +++ b/src/intel/compiler/brw_fs.cpp @@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch() return !failed; } +static void +copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src) +{ + dst->make_empty(); + + foreach_in_list(fs_inst, inst, src) { + dst->push_tail(new(mem_ctx) fs_inst(*inst)); + } +} + bool fs_visitor::run_tes() { @@ -6307,9 +6317,95 @@ fs_visitor::run_tes() assign_tes_urb_setup(); fixup_3src_null_dest(); + + /* The Skylake 3DSTATE_DS documentation says: +* +*"SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader +* kernel uses primitive id." +* +* So, don't bother compiling a dual-patch shader in that case. +*/ + cfg_t *single_patch_cfg = cfg; + if (devinfo->gen >= 9 && + (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 && + !(IN
[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.
Normally, SIMD8 tessellation evaluation shaders operate on a single patch, with each channel operating on a different vertex within the patch. For patch primitives with fewer than 8 vertices, this means that some of the channels are disabled, effectively wasting compute power. This is a fairly common case. To solve this, Skylake and later add "dual-patch" domain shader support. Dual-patch mode only supports PATCHLIST_1..4. The first four channels process vertices in the first patch, while the second four channels process vertices from a second patch. This can double the throughput. Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts two KSPs - one for single-patch mode, and one for dual-patch mode. The hardware will dynamically dispatch whichever kernel makes sense. The two shader modes are nearly identical. The three differences are: - There are two URB handles instead of one. - Dual-patch has an extra payload register (g1) for PrimitiveID. (single-patch packs it in g0, but they couldn't fit two IDs there) - Pushed input data arrives interleaved rather than packed (in SIMD4x2 style rather than SIMD4x1 style). For example, varyings 'a' and 'b' would show up as follows: Single patch (SIMD4x1 style inputs): g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w | Dual patch (SIMD4x2 style inputs): g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w | g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w | This is fairly easy to adjust for - in fact, we can take the FS IR we already generated for single-patch mode and transform it to create a dual-patch shader. We only need to repeat register allocation. Our load_input handling for TES shaders always emits MOVs to expand a scalar input to a full SIMD8 register. While these may be copy propagated away, it does mean all input file access will be a scalar <0,1,0> region. (The FS backend in theory could recognize things like dot(input1, input2) and emit a vector DP4 operation, but it does not do such things today, nor is it likely to gain such support.) Likewise, our URB access reads a single 32-bit URB handle from the payload, expanding it to 8 handles for the SIMD8 URB messages. We can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate the first four times, and the second another four times. Despite having a register for PrimitiveID, the documentation says dual-patch mode is not allowed when PrimitiveID is in use. So we don't need to handle that. This should be all that's required. Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365% (n=1062). --- src/intel/compiler/brw_compiler.h | 2 + src/intel/compiler/brw_fs.cpp | 98 ++- src/intel/compiler/brw_fs.h | 2 + src/intel/compiler/brw_fs_visitor.cpp | 1 + src/intel/compiler/brw_shader.cpp | 3 + src/mesa/drivers/dri/i965/genX_state_upload.c | 7 ++ 6 files changed, 112 insertions(+), 1 deletion(-) This is not that impressive in itself, but it seems like it can only help...the hardware ought to automatically dispatch in dual mode if half the channels in an invocation were going to be wasted. diff --git a/src/intel/compiler/brw_compiler.h b/src/intel/compiler/brw_compiler.h index b1086bbcee5..311cff5a33d 100644 --- a/src/intel/compiler/brw_compiler.h +++ b/src/intel/compiler/brw_compiler.h @@ -986,6 +986,8 @@ struct brw_tes_prog_data enum brw_tess_partitioning partitioning; enum brw_tess_output_topology output_topology; enum brw_tess_domain domain; + + uint32_t prog_offset_dual_patch; }; struct brw_gs_prog_data diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp index bed632d21b9..02383280932 100644 --- a/src/intel/compiler/brw_fs.cpp +++ b/src/intel/compiler/brw_fs.cpp @@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch() return !failed; } +static void +copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src) +{ + dst->make_empty(); + + foreach_in_list(fs_inst, inst, src) { + dst->push_tail(new(mem_ctx) fs_inst(*inst)); + } +} + bool fs_visitor::run_tes() { @@ -6307,9 +6317,95 @@ fs_visitor::run_tes() assign_tes_urb_setup(); fixup_3src_null_dest(); + + /* The Skylake 3DSTATE_DS documentation says: +* +*"SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader +* kernel uses primitive id." +* +* So, don't bother compiling a dual-patch shader in that case. +*/ + cfg_t *single_patch_cfg = cfg; + if (devinfo->gen >= 9 && + (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 && + !(INTEL_DEBUG & DEBUG_NO_DUAL_OBJECT_GS)) { + dual_patch_cfg = new(mem_ctx) cfg_t(*cfg, copy_fs_inst_list); + } + allocate_registers(8, true); - return !failed; + if (failed) + return false; + + if (