subject:"\[Mesa\-dev\] \[PATCH 4\/4\] i965\: Implement dual\-patch tessellation evaluation shaders."

Re: [Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

2018-02-23 Thread Eero Tamminen


Hi,

Could you also give dual & single shaders different names when dumping
them, e.g. something like the attached patch?


- Eero

On 23.02.2018 10:36, Kenneth Graunke wrote:

Normally, SIMD8 tessellation evaluation shaders operate on a single
patch, with each channel operating on a different vertex within the
patch.  For patch primitives with fewer than 8 vertices, this means
that some of the channels are disabled, effectively wasting compute
power.  This is a fairly common case.

To solve this, Skylake and later add "dual-patch" domain shader support.
Dual-patch mode only supports PATCHLIST_1..4.  The first four channels
process vertices in the first patch, while the second four channels
process vertices from a second patch.  This can double the throughput.

Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts
two KSPs - one for single-patch mode, and one for dual-patch mode.  The
hardware will dynamically dispatch whichever kernel makes sense.

The two shader modes are nearly identical.  The three differences are:

- There are two URB handles instead of one.
- Dual-patch has an extra payload register (g1) for PrimitiveID.
   (single-patch packs it in g0, but they couldn't fit two IDs there)
- Pushed input data arrives interleaved rather than packed (in SIMD4x2
   style rather than SIMD4x1 style).  For example, varyings 'a' and 'b'
   would show up as follows:

   Single patch (SIMD4x1 style inputs):
   g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w |

   Dual patch (SIMD4x2 style inputs):
   g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w |
   g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w |

This is fairly easy to adjust for - in fact, we can take the FS IR
we already generated for single-patch mode and transform it to
create a dual-patch shader.  We only need to repeat register allocation.

Our load_input handling for TES shaders always emits MOVs to expand
a scalar input to a full SIMD8 register.  While these may be copy
propagated away, it does mean all input file access will be a scalar
<0,1,0> region.  (The FS backend in theory could recognize things
like dot(input1, input2) and emit a vector DP4 operation, but it does
not do such things today, nor is it likely to gain such support.)

Likewise, our URB access reads a single 32-bit URB handle from the
payload, expanding it to 8 handles for the SIMD8 URB messages.  We
can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate
the first four times, and the second another four times.

Despite having a register for PrimitiveID, the documentation says
dual-patch mode is not allowed when PrimitiveID is in use.  So we
don't need to handle that.

This should be all that's required.

Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on
Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365%
(n=1062).
---
  src/intel/compiler/brw_compiler.h |  2 +
  src/intel/compiler/brw_fs.cpp | 98 ++-
  src/intel/compiler/brw_fs.h   |  2 +
  src/intel/compiler/brw_fs_visitor.cpp |  1 +
  src/intel/compiler/brw_shader.cpp |  3 +
  src/mesa/drivers/dri/i965/genX_state_upload.c |  7 ++
  6 files changed, 112 insertions(+), 1 deletion(-)

This is not that impressive in itself, but it seems like it can only
help...the hardware ought to automatically dispatch in dual mode if
half the channels in an invocation were going to be wasted.

diff --git a/src/intel/compiler/brw_compiler.h 
b/src/intel/compiler/brw_compiler.h
index b1086bbcee5..311cff5a33d 100644
--- a/src/intel/compiler/brw_compiler.h
+++ b/src/intel/compiler/brw_compiler.h
@@ -986,6 +986,8 @@ struct brw_tes_prog_data
 enum brw_tess_partitioning partitioning;
 enum brw_tess_output_topology output_topology;
 enum brw_tess_domain domain;
+
+   uint32_t prog_offset_dual_patch;
  };
  
  struct brw_gs_prog_data

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index bed632d21b9..02383280932 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch()
 return !failed;
  }
  
+static void

+copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src)
+{
+   dst->make_empty();
+
+   foreach_in_list(fs_inst, inst, src) {
+  dst->push_tail(new(mem_ctx) fs_inst(*inst));
+   }
+}
+
  bool
  fs_visitor::run_tes()
  {
@@ -6307,9 +6317,95 @@ fs_visitor::run_tes()
 assign_tes_urb_setup();
  
 fixup_3src_null_dest();

+
+   /* The Skylake 3DSTATE_DS documentation says:
+*
+*"SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader
+* kernel uses primitive id."
+*
+* So, don't bother compiling a dual-patch shader in that case.
+*/
+   cfg_t *single_patch_cfg = cfg;
+   if (devinfo->gen >= 9 &&
+   (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 &&
+   !(IN

[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

2018-02-23 Thread Kenneth Graunke

Normally, SIMD8 tessellation evaluation shaders operate on a single
patch, with each channel operating on a different vertex within the
patch.  For patch primitives with fewer than 8 vertices, this means
that some of the channels are disabled, effectively wasting compute
power.  This is a fairly common case.

To solve this, Skylake and later add "dual-patch" domain shader support.
Dual-patch mode only supports PATCHLIST_1..4.  The first four channels
process vertices in the first patch, while the second four channels
process vertices from a second patch.  This can double the throughput.

Similar to pixel shader SIMD8 vs. SIMD16 handling, 3DSTATE_DS accepts
two KSPs - one for single-patch mode, and one for dual-patch mode.  The
hardware will dynamically dispatch whichever kernel makes sense.

The two shader modes are nearly identical.  The three differences are:

- There are two URB handles instead of one.
- Dual-patch has an extra payload register (g1) for PrimitiveID.
  (single-patch packs it in g0, but they couldn't fit two IDs there)
- Pushed input data arrives interleaved rather than packed (in SIMD4x2
  style rather than SIMD4x1 style).  For example, varyings 'a' and 'b'
  would show up as follows:

  Single patch (SIMD4x1 style inputs):
  g6: | b1.x | b1.y | b1.z | b1.w | a1.x | a1.y | a1.z | a1.w |

  Dual patch (SIMD4x2 style inputs):
  g6: | a2.x | a2.y | a2.z | a2.w | a1.x | a1.y | a1.z | a1.w |
  g7: | b2.x | b2.y | b2.z | b2.w | b1.x | b1.y | b1.z | b1.w |

This is fairly easy to adjust for - in fact, we can take the FS IR
we already generated for single-patch mode and transform it to
create a dual-patch shader.  We only need to repeat register allocation.

Our load_input handling for TES shaders always emits MOVs to expand
a scalar input to a full SIMD8 register.  While these may be copy
propagated away, it does mean all input file access will be a scalar
<0,1,0> region.  (The FS backend in theory could recognize things
like dot(input1, input2) and emit a vector DP4 operation, but it does
not do such things today, nor is it likely to gain such support.)

Likewise, our URB access reads a single 32-bit URB handle from the
payload, expanding it to 8 handles for the SIMD8 URB messages.  We
can adjust the region from <0,1,0> to <1,4,0> on that MOV to replicate
the first four times, and the second another four times.

Despite having a register for PrimitiveID, the documentation says
dual-patch mode is not allowed when PrimitiveID is in use.  So we
don't need to handle that.

This should be all that's required.

Improves performance in Synmark's Gl40TerrainFlyTess at 1024x768 on
Apollolake 3x6 with single-channel RAM by 0.161727% +/- 0.0686365%
(n=1062).
---
 src/intel/compiler/brw_compiler.h |  2 +
 src/intel/compiler/brw_fs.cpp | 98 ++-
 src/intel/compiler/brw_fs.h   |  2 +
 src/intel/compiler/brw_fs_visitor.cpp |  1 +
 src/intel/compiler/brw_shader.cpp |  3 +
 src/mesa/drivers/dri/i965/genX_state_upload.c |  7 ++
 6 files changed, 112 insertions(+), 1 deletion(-)

This is not that impressive in itself, but it seems like it can only
help...the hardware ought to automatically dispatch in dual mode if
half the channels in an invocation were going to be wasted.

diff --git a/src/intel/compiler/brw_compiler.h 
b/src/intel/compiler/brw_compiler.h
index b1086bbcee5..311cff5a33d 100644
--- a/src/intel/compiler/brw_compiler.h
+++ b/src/intel/compiler/brw_compiler.h
@@ -986,6 +986,8 @@ struct brw_tes_prog_data
enum brw_tess_partitioning partitioning;
enum brw_tess_output_topology output_topology;
enum brw_tess_domain domain;
+
+   uint32_t prog_offset_dual_patch;
 };
 
 struct brw_gs_prog_data
diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index bed632d21b9..02383280932 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -6278,6 +6278,16 @@ fs_visitor::run_tcs_single_patch()
return !failed;
 }
 
+static void
+copy_fs_inst_list(void *mem_ctx, exec_list *dst, exec_list *src)
+{
+   dst->make_empty();
+
+   foreach_in_list(fs_inst, inst, src) {
+  dst->push_tail(new(mem_ctx) fs_inst(*inst));
+   }
+}
+
 bool
 fs_visitor::run_tes()
 {
@@ -6307,9 +6317,95 @@ fs_visitor::run_tes()
assign_tes_urb_setup();
 
fixup_3src_null_dest();
+
+   /* The Skylake 3DSTATE_DS documentation says:
+*
+*"SIMD8_SINGLE_OR_DUAL_PATCH must not be used if the domain shader
+* kernel uses primitive id."
+*
+* So, don't bother compiling a dual-patch shader in that case.
+*/
+   cfg_t *single_patch_cfg = cfg;
+   if (devinfo->gen >= 9 &&
+   (nir->info.system_values_read & SYSTEM_BIT_PRIMITIVE_ID) == 0 &&
+   !(INTEL_DEBUG & DEBUG_NO_DUAL_OBJECT_GS)) {
+  dual_patch_cfg = new(mem_ctx) cfg_t(*cfg, copy_fs_inst_list);
+   }
+
allocate_registers(8, true);
 
-   return !failed;
+   if (failed)
+  return false;
+
+   if (

Re: [Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

[Mesa-dev] [PATCH 4/4] i965: Implement dual-patch tessellation evaluation shaders.

2 matches

Site Navigation

Mail list logo

Footer information