Re: [Mesa-dev] [PATCH 3/9] i965/fs: Switch to the constant cache for uniform pull constants.

2016-12-13 Thread Kenneth Graunke
On Friday, December 9, 2016 11:03:26 AM PST Francisco Jerez wrote:
> -/* We have to use a message header on Skylake to get SIMD4x2
> - * mode.  Reserve space for the register.
> -*/

This should reduce the message length of the SENDs, right?  It might be
worth adding some shader-db statistics showing how much spilling is
reduced because of this.

Not a big deal either way - you have real performance data :)


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 3/9] i965/fs: Switch to the constant cache for uniform pull constants.

2016-12-09 Thread Francisco Jerez
This reverts to using the oword block read messages for uniform pull
constant loads, as used to be the case until
4c1fdae0a01b3f92ec03b61aac1d3df5.  There are two important differences
though: Now the L3 cacheability bits are set up correctly for UBOs
(since 11f5d8a5d4fbb861ec161f68593e429cbd65d1cd), and we target the
constant cache instead of the data cache.  The latter used to get no
L3 way allocation on boot on all platforms that existed at the time,
so oword read messages wouldn't get cached on L3 regardless of the
MOCS bits, what probably explains the apparent slowness of oword
fetches.

Constant cache loads seem to perform better than SIMD4x2 sampler loads
in a number of cases, they alleviate some of the cache thrashing
caused by the competition with textures for the L1/L2 sampler caches,
and they allow fetching up to 128B worth of constants with a single
oword fetch message.

Note that IVB devices suffer from a hardware bug that leads to
serialization of L3 read requests overlapping the same cacheline as
result of a (on IVB buggy) mechanism of the L3 to preserve coherency.
Since read requests for matching cachelines from any L3 client are not
pipelined, throughput may decrease in cases where there are no
non-overlapping requests left in the queue that can be processed
between them.

This situation should be relatively uncommon as long as we make sure
that we don't use the 1/2 oword messages in cases where the shader
intends to read from any other location of the same cacheline at some
other point.  This is generally a good idea anyway on all generations
because using the 1 and 2 oword messages is expected to waste
bandwidth since the minimum L3 request size for the DC is exactly 4
owords (i.e. one cacheline).  A future commit will have this effect.
I haven't been able to find any real-world example where this would
still result in a regression on IVB, but if someone happens to find
one it shouldn't be too difficult to add an IVB-specific check to have
it fall back to the sampler cache for pull constant loads.

v3: Non-trivial rebase.
---
 src/mesa/drivers/dri/i965/brw_eu_emit.c|  5 +-
 src/mesa/drivers/dri/i965/brw_fs.cpp   | 42 +++---
 src/mesa/drivers/dri/i965/brw_fs.h |  2 +-
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 78 +-
 4 files changed, 36 insertions(+), 91 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c 
b/src/mesa/drivers/dri/i965/brw_eu_emit.c
index 72b6df6..341f543 100644
--- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
@@ -2266,7 +2266,7 @@ gen7_block_read_scratch(struct brw_codegen *p,
 }
 
 /**
- * Read a float[4] vector from the data port Data Cache (const buffer).
+ * Read a float[4] vector from the data port constant cache.
  * Location (in buffer) should be a multiple of 16.
  * Used for fetching shader constants.
  */
@@ -2278,8 +2278,7 @@ void brw_oword_block_read(struct brw_codegen *p,
 {
const struct gen_device_info *devinfo = p->devinfo;
const unsigned target_cache =
-  (devinfo->gen >= 7 ? GEN7_SFID_DATAPORT_DATA_CACHE :
-   devinfo->gen >= 6 ? GEN6_SFID_DATAPORT_SAMPLER_CACHE :
+  (devinfo->gen >= 6 ? GEN6_SFID_DATAPORT_CONSTANT_CACHE :
BRW_DATAPORT_READ_TARGET_DATA_CACHE);
 
/* On newer hardware, offset is in units of owords. */
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index b5d1381..819d256 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -3202,44 +3202,18 @@ fs_visitor::lower_uniform_pull_constant_loads()
  continue;
 
   if (devinfo->gen >= 7) {
- /* The offset arg is a vec4-aligned immediate byte offset. */
- fs_reg const_offset_reg = inst->src[1];
- assert(const_offset_reg.file == IMM &&
-const_offset_reg.type == BRW_REGISTER_TYPE_UD);
- assert(const_offset_reg.ud % 16 == 0);
-
- fs_reg payload, offset;
- if (devinfo->gen >= 9) {
-/* We have to use a message header on Skylake to get SIMD4x2
- * mode.  Reserve space for the register.
-*/
-offset = payload = fs_reg(VGRF, alloc.allocate(2));
-offset.offset += REG_SIZE;
-inst->mlen = 2;
- } else {
-offset = payload = fs_reg(VGRF, alloc.allocate(1));
-inst->mlen = 1;
- }
-
- /* This is actually going to be a MOV, but since only the first dword
-  * is accessed, we have a special opcode to do just that one.  Note
-  * that this needs to be an operation that will be considered a def
-  * by live variable analysis, or register allocation will explode.
-  */
- fs_inst *setup = new(mem_ctx) fs_inst(FS_OPCODE_SET_SIMD4X2_OFFSET,
-   8, offset, const_offset_reg);
- setup->force_writemask_all =