Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-06-19 Thread Matt Turner
On Thu, May 31, 2018 at 10:39 AM Matt Turner  wrote:
>
> On Fri, May 25, 2018 at 3:28 PM, Matt Turner  wrote:
> >> 1-6, 8-20 are
> >>
> >> Reviewed-by: Matt Turner 
> >
> > 7, 22-31 are too.
>
> 34-49 are too.

21 landed separately. 32, 33, 51-53 are also

Reviewed-by: Matt Turner 

so I think you're just missing a R-b on 50.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-06-01 Thread Eero Tamminen

Hi,

On 30.05.2018 17:30, Jason Ekstrand wrote:

On May 30, 2018 06:45:29 Eero Tamminen  wrote:

On 29.05.2018 18:58, Eero Tamminen wrote:

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently 
hidden

behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning
it on
by default.  The objective of this series is to just to get the 
compiler

infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work fine.


Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests
invoke GPU hangs on GEN7 (BYT & HSW GT2).


That problem is known.  It's caused by using SIMD32 shaders for fast 
clears.  The SIMD32 replicated clear shaders were added on at the last 
minute and didn't get good enough testing before sending out the series. 
We can either drop those two patches and modify the last one to not do 
SIMD32 when use_replicated_clear is set or I have another patch which 
just disables SIMD32 for fast clears.


AFAIK plain copy & write shaders (like clear) don't benefit from SIMD32. 
 They are 100% bottlenecked by input/output bandwidth already with 
SIMD16, so instruction scheduling latency improvement can't help.  At 
worst SIMD32 can make them slower, if it causes extra cache trashing.



- Eero


One option would be to support SIMD32 just for GEN8+.



Tested-by Eero Tamminen 



Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,
VSInstancing & ZBuffer
* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.


On GEN8, the improvements are smaller and regressions larger with
the same heuristic.

Main difference with the 12EU single channel BSW, is -15% regression
in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing
out their average value.  With single-channel memory, increased memory
latency causes a lot more trashing with SIMD32 when many textures are
being sampled close together.



SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler


Heuristics for things affecting texture fetch latencies would help, like
how many fetches there are, to how many different textures and how close
together they are vs. how large caches there are and how fast RAM.


- Eero


that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


   - Eero


Francisco Jerez (34):
  intel/eu: Remove brw_codegen::compressed_stack.
  intel/fs: Rename a local variable so it doesn't shadow component()
  intel/fs: Use the ATTR file for FS inputs
  intel/fs: Replace the CINTERP opcode with a simple MOV
  intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
  intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
    writes.
  intel/eu: Return new instruction to caller from brw_fb_WRITE().
  intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
  intel/fs: Fix implied_mrf_writes() for headerless FB writes.
  intel/fs: Remove program key argument from generator.
  intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
  intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
  intel/eu: Fix pixel interpolator queries for SIMD32.
  intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
  intel/fs: Don't enable dual source blend if no outputs are written
  intel/fs: Fix FB write message control codegen for SIMD32.
  intel/fs: Fix logical FB write lowering for SIMD32
  intel/fs: Fix FB read header setup for SIMD32.
  intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
  intel/fs: Mark LINTERP opcode as 

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-31 Thread Matt Turner
On Fri, May 25, 2018 at 3:28 PM, Matt Turner  wrote:
>> 1-6, 8-20 are
>>
>> Reviewed-by: Matt Turner 
>
> 7, 22-31 are too.

34-49 are too.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-30 Thread Jason Ekstrand

On May 30, 2018 06:45:29 Eero Tamminen  wrote:


Hi,

On 29.05.2018 18:58, Eero Tamminen wrote:

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning
it on
by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work fine.


Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests
invoke GPU hangs on GEN7 (BYT & HSW GT2).


That problem is known.  It's caused by using SIMD32 shaders for fast 
clears.  The SIMD32 replicated clear shaders were added on at the last 
minute and didn't get good enough testing before sending out the series.  
We can either drop those two patches and modify the last one to not do 
SIMD32 when use_replicated_clear is set or I have another patch which just 
disables SIMD32 for fast clears.




One option would be to support SIMD32 just for GEN8+.



Tested-by Eero Tamminen 



Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*,
VSInstancing & ZBuffer
* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.


On GEN8, the improvements are smaller and regressions larger with
the same heuristic.

Main difference with the 12EU single channel BSW, is -15% regression
in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing
out their average value.  With single-channel memory, increased memory
latency causes a lot more trashing with SIMD32 when many textures are
being sampled close together.



SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler


Heuristics for things affecting texture fetch latencies would help, like
how many fetches there are, to how many different textures and how close
together they are vs. how large caches there are and how fast RAM.


- Eero


that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


   - Eero


Francisco Jerez (34):
  intel/eu: Remove brw_codegen::compressed_stack.
  intel/fs: Rename a local variable so it doesn't shadow component()
  intel/fs: Use the ATTR file for FS inputs
  intel/fs: Replace the CINTERP opcode with a simple MOV
  intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
  intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
writes.
  intel/eu: Return new instruction to caller from brw_fb_WRITE().
  intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
  intel/fs: Fix implied_mrf_writes() for headerless FB writes.
  intel/fs: Remove program key argument from generator.
  intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
  intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
  intel/eu: Fix pixel interpolator queries for SIMD32.
  intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
  intel/fs: Don't enable dual source blend if no outputs are written
  intel/fs: Fix FB write message control codegen for SIMD32.
  intel/fs: Fix logical FB write lowering for SIMD32
  intel/fs: Fix FB read header setup for SIMD32.
  intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
  intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
pre-Gen7.
  intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
  i965: Add plumbing for shader time in 32-wide FS dispatch mode.
  intel/fs: Simplify fs_visitor::emit_samplepos_setup
  intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
workaround
  intel/fs: Wrap FS payload register 

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-30 Thread Eero Tamminen

Hi,

On 29.05.2018 18:58, Eero Tamminen wrote:

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning 
it on

by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work fine.


Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests
invoke GPU hangs on GEN7 (BYT & HSW GT2).

One option would be to support SIMD32 just for GEN8+.



Tested-by Eero Tamminen 



Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, 
VSInstancing & ZBuffer

* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

>

On SKL, perf differences are smaller.


On GEN8, the improvements are smaller and regressions larger with
the same heuristic.

Main difference with the 12EU single channel BSW, is -15% regression
in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing
out their average value.  With single-channel memory, increased memory
latency causes a lot more trashing with SIMD32 when many textures are
being sampled close together.



SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler


Heuristics for things affecting texture fetch latencies would help, like
how many fetches there are, to how many different textures and how close
together they are vs. how large caches there are and how fast RAM.


- Eero


that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


 - Eero


Francisco Jerez (34):
   intel/eu: Remove brw_codegen::compressed_stack.
   intel/fs: Rename a local variable so it doesn't shadow component()
   intel/fs: Use the ATTR file for FS inputs
   intel/fs: Replace the CINTERP opcode with a simple MOV
   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
 writes.
   intel/eu: Return new instruction to caller from brw_fb_WRITE().
   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
   intel/fs: Remove program key argument from generator.
   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
   intel/eu: Fix pixel interpolator queries for SIMD32.
   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
   intel/fs: Don't enable dual source blend if no outputs are written
   intel/fs: Fix FB write message control codegen for SIMD32.
   intel/fs: Fix logical FB write lowering for SIMD32
   intel/fs: Fix FB read header setup for SIMD32.
   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
 pre-Gen7.
   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
   intel/fs: Simplify fs_visitor::emit_samplepos_setup
   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
 workaround
   intel/fs: Wrap FS payload register look-up in a helper function.
   intel/fs: Extend thread payload layout to SIMD32
   intel/fs: Implement 32-wide FS payload setup on Gen6+
   intel/fs: Fix Gen7 compressed source region alignment restriction for
 SIMD32
   intel/fs: Fix sample id setup for SIMD32.
   intel/fs: Generalize the unlit centroid workaround
   intel/fs: Fix Gen6+ interpolation setup for SIMD32
   intel/fs: Fix 

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-29 Thread Eero Tamminen

Hi,

On 29.05.2018 18:58, Eero Tamminen wrote:

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning 
it on

by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work otherwise fine.


s/otherwise//


- Eero

(regardless of how many times one reads a mail before sending, there
always seems to be some leftover one misses.)


Tested-by Eero Tamminen 



Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, 
VSInstancing & ZBuffer

* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.

SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler
that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


 - Eero


Francisco Jerez (34):
   intel/eu: Remove brw_codegen::compressed_stack.
   intel/fs: Rename a local variable so it doesn't shadow component()
   intel/fs: Use the ATTR file for FS inputs
   intel/fs: Replace the CINTERP opcode with a simple MOV
   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
 writes.
   intel/eu: Return new instruction to caller from brw_fb_WRITE().
   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
   intel/fs: Remove program key argument from generator.
   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
   intel/eu: Fix pixel interpolator queries for SIMD32.
   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
   intel/fs: Don't enable dual source blend if no outputs are written
   intel/fs: Fix FB write message control codegen for SIMD32.
   intel/fs: Fix logical FB write lowering for SIMD32
   intel/fs: Fix FB read header setup for SIMD32.
   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
 pre-Gen7.
   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
   intel/fs: Simplify fs_visitor::emit_samplepos_setup
   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
 workaround
   intel/fs: Wrap FS payload register look-up in a helper function.
   intel/fs: Extend thread payload layout to SIMD32
   intel/fs: Implement 32-wide FS payload setup on Gen6+
   intel/fs: Fix Gen7 compressed source region alignment restriction for
 SIMD32
   intel/fs: Fix sample id setup for SIMD32.
   intel/fs: Generalize the unlit centroid workaround
   intel/fs: Fix Gen6+ interpolation setup for SIMD32
   intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
   intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
   intel/fs: Build 32-wide FS shaders.

Jason Ekstrand (19):
   intel/fs: Assert that the gen4-6 plane restrictions are followed
   intel/fs: Use groups for SIMD16 LINTERP on gen11+
   intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
   intel/fs: Properly track implied header regs read by FB writes
   intel/fs: Pull FB write implied headers from src[0]
   intel/fs: Set up FB write message headers in the visitor
   i965: Re-arrange shader kernel setup in WM state
   intel/compiler: Add and use helpers for working with KSP indices
   intel/fs: 

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-29 Thread Eero Tamminen

Hi,

On 25.05.2018 00:55, Jason Ekstrand wrote:

This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning it on
by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.


Tested v3 on BXT & SKL.  Everything seems to work otherwise fine.

Tested-by Eero Tamminen 



Figuring out a good heuristic is left as an exercise to the reader. :-)


Simple heuristic that just enables SIMD32 for everything that isn't
MRT shader, gives nice perf improvements on BXT J4205:
* +30% GfxBench ALU2
* +25% SynMark PSPom
* +10% GpuTest Julia32
* +9% GfxBench CarChase
* +7% GfxBench Manhattan 3.0
* +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle
* +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark
* -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, 
VSInstancing & ZBuffer

* -2-3% GLB 2.7 Fill
* -4-5% MemBW Blend

On SKL, perf differences are smaller.

SIMD32 can cause write bound tests to trash, which is visible as perf
regression in fully write bound tests above (that's also the reason
why SIMD32 is good to disable with MRT shaders).

As to reads, SIMD32 improves cache locality until it starts trashing.
In above GfxBench tests, and amount of texture sampling they do, this
shows in HW counters as increased texture cache misses (trashing), but
less L3 misses (better locality).  Along with (more important) better
latency compensation, these explain why SIMD32 improves performance in
them.


More advanced heuristics that try to avoid the SIMD32 performance
regressions, unfortunately also get rid of clear part of the above
improvements.  Such heuristics would need improved instruction scheduler
that provides feedback on which shaders have latency issues where SIMD32
would help.

(A potential run-time heuristics would be disabling SIMD32 when too
large textures are bound for draw.)


- Eero


Francisco Jerez (34):
   intel/eu: Remove brw_codegen::compressed_stack.
   intel/fs: Rename a local variable so it doesn't shadow component()
   intel/fs: Use the ATTR file for FS inputs
   intel/fs: Replace the CINTERP opcode with a simple MOV
   intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
   intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
 writes.
   intel/eu: Return new instruction to caller from brw_fb_WRITE().
   intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
   intel/fs: Fix implied_mrf_writes() for headerless FB writes.
   intel/fs: Remove program key argument from generator.
   intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
   intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
   intel/eu: Fix pixel interpolator queries for SIMD32.
   intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
   intel/fs: Don't enable dual source blend if no outputs are written
   intel/fs: Fix FB write message control codegen for SIMD32.
   intel/fs: Fix logical FB write lowering for SIMD32
   intel/fs: Fix FB read header setup for SIMD32.
   intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
   intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
 pre-Gen7.
   intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
   i965: Add plumbing for shader time in 32-wide FS dispatch mode.
   intel/fs: Simplify fs_visitor::emit_samplepos_setup
   intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
 workaround
   intel/fs: Wrap FS payload register look-up in a helper function.
   intel/fs: Extend thread payload layout to SIMD32
   intel/fs: Implement 32-wide FS payload setup on Gen6+
   intel/fs: Fix Gen7 compressed source region alignment restriction for
 SIMD32
   intel/fs: Fix sample id setup for SIMD32.
   intel/fs: Generalize the unlit centroid workaround
   intel/fs: Fix Gen6+ interpolation setup for SIMD32
   intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
   intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
   intel/fs: Build 32-wide FS shaders.

Jason Ekstrand (19):
   intel/fs: Assert that the gen4-6 plane restrictions are followed
   intel/fs: Use groups for SIMD16 LINTERP on gen11+
   intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
   intel/fs: Properly track implied header regs read by FB writes
   intel/fs: Pull FB write implied headers from src[0]
   intel/fs: Set up FB write message headers in the visitor
   i965: Re-arrange shader kernel setup in WM state
   intel/compiler: Add and use helpers for working with KSP indices
   intel/fs: Rework KSP data to be SIMD width-based
   intel/fs: Split instructions low to high in lower_simd_width
   intel/fs: Properly copy default flag reg for 3src instrucitons
   intel/fs: Add the 

Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-25 Thread Matt Turner
On Fri, May 25, 2018 at 11:50 AM, Matt Turner  wrote:
> On Thu, May 24, 2018 at 2:55 PM, Jason Ekstrand  wrote:
>> This patch series adds back-end compiler support for SIMD32 fragment
>> shaders.  Support is added and everything works but it's currently hidden
>> behind INTEL_DEBUG=do32.  We know that it improves performance in some
>> cases but we do not yet have a good enough heuristic to start turning it on
>> by default.  The objective of this series is to just to get the compiler
>> infrastructure landed so that it stops bit-rotting in Curro's branch.
>> Figuring out a good heuristic is left as an exercise to the reader. :-)
>
> 1-6, 8-20 are
>
> Reviewed-by: Matt Turner 

7, 22-31 are too.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-25 Thread Matt Turner
On Thu, May 24, 2018 at 2:55 PM, Jason Ekstrand  wrote:
> This patch series adds back-end compiler support for SIMD32 fragment
> shaders.  Support is added and everything works but it's currently hidden
> behind INTEL_DEBUG=do32.  We know that it improves performance in some
> cases but we do not yet have a good enough heuristic to start turning it on
> by default.  The objective of this series is to just to get the compiler
> infrastructure landed so that it stops bit-rotting in Curro's branch.
> Figuring out a good heuristic is left as an exercise to the reader. :-)

1-6, 8-20 are

Reviewed-by: Matt Turner 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders

2018-05-24 Thread Jason Ekstrand
This patch series adds back-end compiler support for SIMD32 fragment
shaders.  Support is added and everything works but it's currently hidden
behind INTEL_DEBUG=do32.  We know that it improves performance in some
cases but we do not yet have a good enough heuristic to start turning it on
by default.  The objective of this series is to just to get the compiler
infrastructure landed so that it stops bit-rotting in Curro's branch.
Figuring out a good heuristic is left as an exercise to the reader. :-)

Francisco Jerez (34):
  intel/eu: Remove brw_codegen::compressed_stack.
  intel/fs: Rename a local variable so it doesn't shadow component()
  intel/fs: Use the ATTR file for FS inputs
  intel/fs: Replace the CINTERP opcode with a simple MOV
  intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot.
  intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT
writes.
  intel/eu: Return new instruction to caller from brw_fb_WRITE().
  intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes.
  intel/fs: Fix implied_mrf_writes() for headerless FB writes.
  intel/fs: Remove program key argument from generator.
  intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow
  intel/fs: Disable SIMD32 dispatch for fragment shaders with discard.
  intel/eu: Fix pixel interpolator queries for SIMD32.
  intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32.
  intel/fs: Don't enable dual source blend if no outputs are written
  intel/fs: Fix FB write message control codegen for SIMD32.
  intel/fs: Fix logical FB write lowering for SIMD32
  intel/fs: Fix FB read header setup for SIMD32.
  intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET
  intel/fs: Mark LINTERP opcode as writing accumulator implicitly on
pre-Gen7.
  intel/fs: Disable opt_sampler_eot() in 32-wide dispatch.
  i965: Add plumbing for shader time in 32-wide FS dispatch mode.
  intel/fs: Simplify fs_visitor::emit_samplepos_setup
  intel/fs: Use fs_regs instead of brw_regs in the unlit centroid
workaround
  intel/fs: Wrap FS payload register look-up in a helper function.
  intel/fs: Extend thread payload layout to SIMD32
  intel/fs: Implement 32-wide FS payload setup on Gen6+
  intel/fs: Fix Gen7 compressed source region alignment restriction for
SIMD32
  intel/fs: Fix sample id setup for SIMD32.
  intel/fs: Generalize the unlit centroid workaround
  intel/fs: Fix Gen6+ interpolation setup for SIMD32
  intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch.
  intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32.
  intel/fs: Build 32-wide FS shaders.

Jason Ekstrand (19):
  intel/fs: Assert that the gen4-6 plane restrictions are followed
  intel/fs: Use groups for SIMD16 LINTERP on gen11+
  intel/fs: FS_OPCODE_REP_FB_WRITE has side effects
  intel/fs: Properly track implied header regs read by FB writes
  intel/fs: Pull FB write implied headers from src[0]
  intel/fs: Set up FB write message headers in the visitor
  i965: Re-arrange shader kernel setup in WM state
  intel/compiler: Add and use helpers for working with KSP indices
  intel/fs: Rework KSP data to be SIMD width-based
  intel/fs: Split instructions low to high in lower_simd_width
  intel/fs: Properly copy default flag reg for 3src instrucitons
  intel/fs: Add the group to the flag subreg number on SNB and older
  intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates
  intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround
  intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS
  intel/fs: Add fields to wm_prog_data for SIMD32 dispatch
  intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround
  intel/fs: Remove support push constants in repclear shaders
  intel/fs: Support SIMD32 repclear shaders

 src/intel/blorp/blorp.c   |   2 +-
 src/intel/blorp/blorp_genX_exec.h |  82 +++-
 src/intel/compiler/brw_compiler.h |  98 +++-
 src/intel/compiler/brw_eu.h   |  21 +-
 src/intel/compiler/brw_eu_defines.h   |   2 -
 src/intel/compiler/brw_eu_emit.c  |  39 +-
 src/intel/compiler/brw_fs.cpp | 666 --
 src/intel/compiler/brw_fs.h   |  53 +-
 src/intel/compiler/brw_fs_builder.h   |   6 +-
 src/intel/compiler/brw_fs_cse.cpp |   1 -
 src/intel/compiler/brw_fs_generator.cpp   | 318 ++--
 src/intel/compiler/brw_fs_nir.cpp |  57 ++-
 src/intel/compiler/brw_fs_visitor.cpp | 193 
 src/intel/compiler/brw_ir_fs.h|   1 +
 src/intel/compiler/brw_shader.cpp |  12 +-
 src/intel/compiler/brw_vec4.cpp   |   2 +-
 src/intel/compiler/brw_vec4_gs_visitor.cpp|   2 +-
 src/intel/compiler/brw_vec4_tcs.cpp   |   2 +-
 src/intel/compiler/brw_wm_iz.cpp  |  11 +-
 src/intel/vulkan/anv_pipeline.c   |   2 +-
 src/intel/vulkan/genX_pipeline.c  |  40 +-