Re: [Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-14 Thread Francisco Jerez
Kenneth Graunke  writes:

> On Wednesday, December 14, 2016 2:18:16 PM PST Francisco Jerez wrote:
>> Francisco Jerez  writes:
>> 
>> > Kenneth Graunke  writes:
>> >
>> >> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
>> >>> Asking the DC for less than one cacheline (4 owords) of data for
>> >>> uniform pull constants is suboptimal because the DC cannot request
>> >>> less than that from L3, resulting in wasted bandwidth and unnecessary
>> >>> message dispatch overhead, and exacerbating the IVB L3 serialization
>> >>> bug.  The following table summarizes the overall framerate improvement
>> >>> (with statistical significance of 5% and sample size ~10) from the
>> >>> whole series up to this patch for several benchmarks and hardware
>> >>> generations:
>> >>> 
>> >>>  | SKL   | BDW  | HSW
>> >>> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
>> >>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
>> >>> GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
>> >>> Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
>> >>
>> >> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
>> >> pressure, from the lower message lengths on pull loads.  (At least, it
>> >> did when I had a series to fix that.)  That's probably a large portion
>> >> of the performance improvement here, and why it's so much larger for
>> >> that workload on Skylake specifically.  It might be worth mentioning it
>> >> in your commit message here.
>> >>
>> >
>> > Yeah, that matches my understanding too.  I'll add some shader-db stats
>> > in order to illustrate the effect of this on register pressure, as you
>> > asked me to do in your previous reply.
>> >
>> 
>> FTR, here is a summary of the effect of this series on several shader-db
>> stats.  As you can see the register pressure benefit on SKL+ is
>> substantial:
>> 
>>  Lost->Gained Total instructions  Total cycles   
>>  Total spills  Total fills
>> BWR:  5 ->   54571248 -> 4568342 (-0.06%) 123375740 -> 123373296 
>> (-0.00%) 1488 -> 1488 (0.00%)  1957 -> 1957 (0.00%)
>> ELK:  5 ->   53989020 -> 3985402 (-0.09%)  98757068 -> 98754058 (-0.00%) 
>>  1489 -> 1489 (0.00%)  1958 -> 1958 (0.00%)
>> ILK:  1 ->   46383591 -> 6376787 (-0.11%) 143649910 -> 143648914 
>> (-0.00%) 1449 -> 1449 (0.00%)  1921 -> 1921 (0.00%)
>> SNB:  0 ->   07528395 -> 7501446 (-0.36%) 103503796 -> 102460370 
>> (-1.01%)  549 -> 549 (0.00%) 52 -> 52 (0.00%)
>> IVB: 13 ->   36949221 -> 6943317 (-0.08%)  60592262 -> 60584422 (-0.01%) 
>>  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
>> HSW: 11 ->   06409753 -> 6403702 (-0.09%)  60609070 -> 60604414 (-0.01%) 
>>  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
>> BDW: 12 ->   08043467 -> 7976364 (-0.83%)  68427730 -> 68483042 (0.08%)  
>>  1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
>> CHV: 12 ->   08045019 -> 7977916 (-0.83%)  68297426 -> 68352756 (0.08%)  
>>  1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
>> SKL:  0 -> 1208204037 -> 7939086 (-3.23%)  66583900 -> 65624378 (-1.44%) 
>>  1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)
>
> I'm a bit surprised that Gen7-8 lost SIMD16 programs.  Presumably there
> are some cases where we don't need the whole cacheline worth of pulled
> data, and this increased register pressure.  I suppose that could be
> fixed by demoting pull message return length when the last channels
> aren't used.  We might want to do that later on.
>

Yeah, that's one of the two reasons I reworked things somewhat with
respect to the previous version, in order to give the optimizer control
on the number of OWORDs read by the pull constant load message (the
other reason is so we're able to be more aggressive in the future and
request 8 OWORDs at a time per pull constant message).  Though as you
can see asking for less OWORDs would only help ~10 shaders re-gain
SIMD16 on Gen7-8, which is tiny compared to the benefit on Gen9+.  On
top of that the register pressure scheduling heuristic I'm working on
re-gains us most of the SIMD16 shaders lost here (and a bunch more), so
it didn't seem terribly concerning.

> --Ken


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-14 Thread Kenneth Graunke
On Wednesday, December 14, 2016 2:18:16 PM PST Francisco Jerez wrote:
> Francisco Jerez  writes:
> 
> > Kenneth Graunke  writes:
> >
> >> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
> >>> Asking the DC for less than one cacheline (4 owords) of data for
> >>> uniform pull constants is suboptimal because the DC cannot request
> >>> less than that from L3, resulting in wasted bandwidth and unnecessary
> >>> message dispatch overhead, and exacerbating the IVB L3 serialization
> >>> bug.  The following table summarizes the overall framerate improvement
> >>> (with statistical significance of 5% and sample size ~10) from the
> >>> whole series up to this patch for several benchmarks and hardware
> >>> generations:
> >>> 
> >>>  | SKL   | BDW  | HSW
> >>> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
> >>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
> >>> GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
> >>> Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
> >>
> >> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
> >> pressure, from the lower message lengths on pull loads.  (At least, it
> >> did when I had a series to fix that.)  That's probably a large portion
> >> of the performance improvement here, and why it's so much larger for
> >> that workload on Skylake specifically.  It might be worth mentioning it
> >> in your commit message here.
> >>
> >
> > Yeah, that matches my understanding too.  I'll add some shader-db stats
> > in order to illustrate the effect of this on register pressure, as you
> > asked me to do in your previous reply.
> >
> 
> FTR, here is a summary of the effect of this series on several shader-db
> stats.  As you can see the register pressure benefit on SKL+ is
> substantial:
> 
>  Lost->Gained Total instructions  Total cycles
> Total spills  Total fills
> BWR:  5 ->   54571248 -> 4568342 (-0.06%) 123375740 -> 123373296 (-0.00%) 
> 1488 -> 1488 (0.00%)  1957 -> 1957 (0.00%)
> ELK:  5 ->   53989020 -> 3985402 (-0.09%)  98757068 -> 98754058 (-0.00%)  
> 1489 -> 1489 (0.00%)  1958 -> 1958 (0.00%)
> ILK:  1 ->   46383591 -> 6376787 (-0.11%) 143649910 -> 143648914 (-0.00%) 
> 1449 -> 1449 (0.00%)  1921 -> 1921 (0.00%)
> SNB:  0 ->   07528395 -> 7501446 (-0.36%) 103503796 -> 102460370 (-1.01%) 
>  549 -> 549 (0.00%) 52 -> 52 (0.00%)
> IVB: 13 ->   36949221 -> 6943317 (-0.08%)  60592262 -> 60584422 (-0.01%)  
> 1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
> HSW: 11 ->   06409753 -> 6403702 (-0.09%)  60609070 -> 60604414 (-0.01%)  
> 1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
> BDW: 12 ->   08043467 -> 7976364 (-0.83%)  68427730 -> 68483042 (0.08%)   
> 1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
> CHV: 12 ->   08045019 -> 7977916 (-0.83%)  68297426 -> 68352756 (0.08%)   
> 1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
> SKL:  0 -> 1208204037 -> 7939086 (-3.23%)  66583900 -> 65624378 (-1.44%)  
> 1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)

I'm a bit surprised that Gen7-8 lost SIMD16 programs.  Presumably there
are some cases where we don't need the whole cacheline worth of pulled
data, and this increased register pressure.  I suppose that could be
fixed by demoting pull message return length when the last channels
aren't used.  We might want to do that later on.

--Ken


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-14 Thread Francisco Jerez
Francisco Jerez  writes:

> Kenneth Graunke  writes:
>
>> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
>>> Asking the DC for less than one cacheline (4 owords) of data for
>>> uniform pull constants is suboptimal because the DC cannot request
>>> less than that from L3, resulting in wasted bandwidth and unnecessary
>>> message dispatch overhead, and exacerbating the IVB L3 serialization
>>> bug.  The following table summarizes the overall framerate improvement
>>> (with statistical significance of 5% and sample size ~10) from the
>>> whole series up to this patch for several benchmarks and hardware
>>> generations:
>>> 
>>>  | SKL   | BDW  | HSW
>>> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
>>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
>>> GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
>>> Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
>>
>> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
>> pressure, from the lower message lengths on pull loads.  (At least, it
>> did when I had a series to fix that.)  That's probably a large portion
>> of the performance improvement here, and why it's so much larger for
>> that workload on Skylake specifically.  It might be worth mentioning it
>> in your commit message here.
>>
>
> Yeah, that matches my understanding too.  I'll add some shader-db stats
> in order to illustrate the effect of this on register pressure, as you
> asked me to do in your previous reply.
>

FTR, here is a summary of the effect of this series on several shader-db
stats.  As you can see the register pressure benefit on SKL+ is
substantial:

 Lost->Gained Total instructions  Total cycles
Total spills  Total fills
BWR:  5 ->   54571248 -> 4568342 (-0.06%) 123375740 -> 123373296 (-0.00%) 
1488 -> 1488 (0.00%)  1957 -> 1957 (0.00%)
ELK:  5 ->   53989020 -> 3985402 (-0.09%)  98757068 -> 98754058 (-0.00%)  
1489 -> 1489 (0.00%)  1958 -> 1958 (0.00%)
ILK:  1 ->   46383591 -> 6376787 (-0.11%) 143649910 -> 143648914 (-0.00%) 
1449 -> 1449 (0.00%)  1921 -> 1921 (0.00%)
SNB:  0 ->   07528395 -> 7501446 (-0.36%) 103503796 -> 102460370 (-1.01%)  
549 -> 549 (0.00%) 52 -> 52 (0.00%)
IVB: 13 ->   36949221 -> 6943317 (-0.08%)  60592262 -> 60584422 (-0.01%)  
1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
HSW: 11 ->   06409753 -> 6403702 (-0.09%)  60609070 -> 60604414 (-0.01%)  
1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
BDW: 12 ->   08043467 -> 7976364 (-0.83%)  68427730 -> 68483042 (0.08%)   
1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
CHV: 12 ->   08045019 -> 7977916 (-0.83%)  68297426 -> 68352756 (0.08%)   
1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
SKL:  0 -> 1208204037 -> 7939086 (-3.23%)  66583900 -> 65624378 (-1.44%)  
1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)

>> Thanks for all your work on this.  I was originally concerned about the
>> Ivybridge bug, but given that we're loading one cacheline at a time, it
>> seems very unlikely that we'd ever load the same cacheline twice within
>> 16 cycles.  We could if we have IF (non-uniform) load[foo] ELSE load[bar]
>> where foo and bar are indirect expressions that happen to be equal.  But
>> that seems quite uncommon.
>>
>> Series is:
>> Reviewed-by: Kenneth Graunke 
>
> Thanks!


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-13 Thread Francisco Jerez
Kenneth Graunke  writes:

> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
>> Asking the DC for less than one cacheline (4 owords) of data for
>> uniform pull constants is suboptimal because the DC cannot request
>> less than that from L3, resulting in wasted bandwidth and unnecessary
>> message dispatch overhead, and exacerbating the IVB L3 serialization
>> bug.  The following table summarizes the overall framerate improvement
>> (with statistical significance of 5% and sample size ~10) from the
>> whole series up to this patch for several benchmarks and hardware
>> generations:
>> 
>>  | SKL   | BDW  | HSW
>> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
>> GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
>> Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
>
> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
> pressure, from the lower message lengths on pull loads.  (At least, it
> did when I had a series to fix that.)  That's probably a large portion
> of the performance improvement here, and why it's so much larger for
> that workload on Skylake specifically.  It might be worth mentioning it
> in your commit message here.
>

Yeah, that matches my understanding too.  I'll add some shader-db stats
in order to illustrate the effect of this on register pressure, as you
asked me to do in your previous reply.

> Thanks for all your work on this.  I was originally concerned about the
> Ivybridge bug, but given that we're loading one cacheline at a time, it
> seems very unlikely that we'd ever load the same cacheline twice within
> 16 cycles.  We could if we have IF (non-uniform) load[foo] ELSE load[bar]
> where foo and bar are indirect expressions that happen to be equal.  But
> that seems quite uncommon.
>
> Series is:
> Reviewed-by: Kenneth Graunke 

Thanks!


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-13 Thread Kenneth Graunke
On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
> Asking the DC for less than one cacheline (4 owords) of data for
> uniform pull constants is suboptimal because the DC cannot request
> less than that from L3, resulting in wasted bandwidth and unnecessary
> message dispatch overhead, and exacerbating the IVB L3 serialization
> bug.  The following table summarizes the overall framerate improvement
> (with statistical significance of 5% and sample size ~10) from the
> whole series up to this patch for several benchmarks and hardware
> generations:
> 
>  | SKL   | BDW  | HSW
> SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
> GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
> Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%

I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
pressure, from the lower message lengths on pull loads.  (At least, it
did when I had a series to fix that.)  That's probably a large portion
of the performance improvement here, and why it's so much larger for
that workload on Skylake specifically.  It might be worth mentioning it
in your commit message here.

Thanks for all your work on this.  I was originally concerned about the
Ivybridge bug, but given that we're loading one cacheline at a time, it
seems very unlikely that we'd ever load the same cacheline twice within
16 cycles.  We could if we have IF (non-uniform) load[foo] ELSE load[bar]
where foo and bar are indirect expressions that happen to be equal.  But
that seems quite uncommon.

Series is:
Reviewed-by: Kenneth Graunke 


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

2016-12-09 Thread Francisco Jerez
Asking the DC for less than one cacheline (4 owords) of data for
uniform pull constants is suboptimal because the DC cannot request
less than that from L3, resulting in wasted bandwidth and unnecessary
message dispatch overhead, and exacerbating the IVB L3 serialization
bug.  The following table summarizes the overall framerate improvement
(with statistical significance of 5% and sample size ~10) from the
whole series up to this patch for several benchmarks and hardware
generations:

 | SKL   | BDW  | HSW
SynMark2 OglShMapPcf | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
GfxBench4 gl_4   |  2.52% ±0.44% | 1.23% ±0.10% |  N/A
Unigine Valley   |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%

Note that there are two versions of the Manhattan demo shipped with
GfxBench4, one of them is the original gl_manhattan demo which doesn't
use UBOs, so this patch will have no effect on it, and another one is
the gl_manhattan31 demo based on GL 4.3/GLES 3.1, which this patch
benefits as shown above.

I haven't observed any statistically significant regressions in the
benchmarks I have at hand.

Going up to 8 oword blocks would improve performance of pull constants
even more, but at the cost of some additional bandwidth and register
pressure, so it would have to be done on-demand based on the number of
constants actually used by the shader.

v2: Fix for Gen4 and 5.
v3: Non-trivial rebase.  Rework to allow the visitor specifiy
arbitrary pull constant block sizes.
---
 src/mesa/drivers/dri/i965/brw_fs.cpp | 21 +
 src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 16 +---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index b6a571a..0221287 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -2111,25 +2111,22 @@ fs_visitor::lower_constant_loads()
  if (pull_index == -1)
continue;
 
- const unsigned index = 
stage_prog_data->binding_table.pull_constants_start;
- fs_reg dst;
-
- if (type_sz(inst->src[i].type) <= 4)
-dst = vgrf(glsl_type::float_type);
- else
-dst = vgrf(glsl_type::double_type);
-
  assert(inst->src[i].stride == 0);
 
- const fs_builder ubld = ibld.exec_all().group(4, 0);
- struct brw_reg offset = brw_imm_ud((unsigned)(pull_index * 4) & ~15);
+ const unsigned index = 
stage_prog_data->binding_table.pull_constants_start;
+ const unsigned block_sz = 64; /* Fetch one cacheline at a time. */
+ const fs_builder ubld = ibld.exec_all().group(block_sz / 4, 0);
+ const fs_reg dst = ubld.vgrf(BRW_REGISTER_TYPE_UD);
+ const unsigned base = pull_index * 4;
+
  ubld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
-   dst, brw_imm_ud(index), offset);
+   dst, brw_imm_ud(index), brw_imm_ud(base & ~(block_sz - 1)));
 
  /* Rewrite the instruction to use the temporary VGRF. */
  inst->src[i].file = VGRF;
  inst->src[i].nr = dst.nr;
- inst->src[i].offset = (pull_index & 3) * 4 + inst->src[i].offset % 4;
+ inst->src[i].offset = (base & (block_sz - 1)) +
+   inst->src[i].offset % 4;
 
  brw_mark_surface_used(prog_data, index);
   }
diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp 
b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
index 7e00086..e97cae3 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
@@ -4059,21 +4059,23 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
nir_intrinsic_instr *instr
   * and we have to split it if necessary.
   */
  const unsigned type_size = type_sz(dest.type);
- const fs_builder ubld = bld.exec_all().group(4, 0);
- const fs_reg packed_consts = ubld.vgrf(BRW_REGISTER_TYPE_F);
+ const unsigned block_sz = 64; /* Fetch one cacheline at a time. */
+ const fs_builder ubld = bld.exec_all().group(block_sz / 4, 0);
+ const fs_reg packed_consts = ubld.vgrf(BRW_REGISTER_TYPE_UD);
 
  for (unsigned c = 0; c < instr->num_components;) {
 const unsigned base = const_offset->u32[0] + c * type_size;
-
-/* Number of usable components in the next 16B-aligned load */
+/* Number of usable components in the next block-aligned load. */
 const unsigned count = MIN2(instr->num_components - c,
-(16 - base % 16) / type_size);
+(block_sz - base % block_sz) / 
type_size);
 
 ubld.emit(FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD,
-  packed_consts, surf_index, brw_imm_ud(base & ~15));
+