Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
On Thu, May 31, 2018 at 10:39 AM Matt Turner wrote: > > On Fri, May 25, 2018 at 3:28 PM, Matt Turner wrote: > >> 1-6, 8-20 are > >> > >> Reviewed-by: Matt Turner > > > > 7, 22-31 are too. > > 34-49 are too. 21 landed separately. 32, 33, 51-53 are also Reviewed-by: Matt Turner so I think you're just missing a R-b on 50. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
Hi, On 30.05.2018 17:30, Jason Ekstrand wrote: On May 30, 2018 06:45:29 Eero Tamminen wrote: On 29.05.2018 18:58, Eero Tamminen wrote: On 25.05.2018 00:55, Jason Ekstrand wrote: This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Tested v3 on BXT & SKL. Everything seems to work fine. Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests invoke GPU hangs on GEN7 (BYT & HSW GT2). That problem is known. It's caused by using SIMD32 shaders for fast clears. The SIMD32 replicated clear shaders were added on at the last minute and didn't get good enough testing before sending out the series. We can either drop those two patches and modify the last one to not do SIMD32 when use_replicated_clear is set or I have another patch which just disables SIMD32 for fast clears. AFAIK plain copy & write shaders (like clear) don't benefit from SIMD32. They are 100% bottlenecked by input/output bandwidth already with SIMD16, so instruction scheduling latency improvement can't help. At worst SIMD32 can make them slower, if it causes extra cache trashing. - Eero One option would be to support SIMD32 just for GEN8+. Tested-by Eero Tamminen Figuring out a good heuristic is left as an exercise to the reader. :-) Simple heuristic that just enables SIMD32 for everything that isn't MRT shader, gives nice perf improvements on BXT J4205: * +30% GfxBench ALU2 * +25% SynMark PSPom * +10% GpuTest Julia32 * +9% GfxBench CarChase * +7% GfxBench Manhattan 3.0 * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, VSInstancing & ZBuffer * -2-3% GLB 2.7 Fill * -4-5% MemBW Blend On SKL, perf differences are smaller. On GEN8, the improvements are smaller and regressions larger with the same heuristic. Main difference with the 12EU single channel BSW, is -15% regression in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing out their average value. With single-channel memory, increased memory latency causes a lot more trashing with SIMD32 when many textures are being sampled close together. SIMD32 can cause write bound tests to trash, which is visible as perf regression in fully write bound tests above (that's also the reason why SIMD32 is good to disable with MRT shaders). As to reads, SIMD32 improves cache locality until it starts trashing. In above GfxBench tests, and amount of texture sampling they do, this shows in HW counters as increased texture cache misses (trashing), but less L3 misses (better locality). Along with (more important) better latency compensation, these explain why SIMD32 improves performance in them. More advanced heuristics that try to avoid the SIMD32 performance regressions, unfortunately also get rid of clear part of the above improvements. Such heuristics would need improved instruction scheduler Heuristics for things affecting texture fetch latencies would help, like how many fetches there are, to how many different textures and how close together they are vs. how large caches there are and how fast RAM. - Eero that provides feedback on which shaders have latency issues where SIMD32 would help. (A potential run-time heuristics would be disabling SIMD32 when too large textures are bound for draw.) - Eero Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
On Fri, May 25, 2018 at 3:28 PM, Matt Turner wrote: >> 1-6, 8-20 are >> >> Reviewed-by: Matt Turner > > 7, 22-31 are too. 34-49 are too. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
On May 30, 2018 06:45:29 Eero Tamminen wrote: Hi, On 29.05.2018 18:58, Eero Tamminen wrote: On 25.05.2018 00:55, Jason Ekstrand wrote: This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Tested v3 on BXT & SKL. Everything seems to work fine. Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests invoke GPU hangs on GEN7 (BYT & HSW GT2). That problem is known. It's caused by using SIMD32 shaders for fast clears. The SIMD32 replicated clear shaders were added on at the last minute and didn't get good enough testing before sending out the series. We can either drop those two patches and modify the last one to not do SIMD32 when use_replicated_clear is set or I have another patch which just disables SIMD32 for fast clears. One option would be to support SIMD32 just for GEN8+. Tested-by Eero Tamminen Figuring out a good heuristic is left as an exercise to the reader. :-) Simple heuristic that just enables SIMD32 for everything that isn't MRT shader, gives nice perf improvements on BXT J4205: * +30% GfxBench ALU2 * +25% SynMark PSPom * +10% GpuTest Julia32 * +9% GfxBench CarChase * +7% GfxBench Manhattan 3.0 * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, VSInstancing & ZBuffer * -2-3% GLB 2.7 Fill * -4-5% MemBW Blend On SKL, perf differences are smaller. On GEN8, the improvements are smaller and regressions larger with the same heuristic. Main difference with the 12EU single channel BSW, is -15% regression in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing out their average value. With single-channel memory, increased memory latency causes a lot more trashing with SIMD32 when many textures are being sampled close together. SIMD32 can cause write bound tests to trash, which is visible as perf regression in fully write bound tests above (that's also the reason why SIMD32 is good to disable with MRT shaders). As to reads, SIMD32 improves cache locality until it starts trashing. In above GfxBench tests, and amount of texture sampling they do, this shows in HW counters as increased texture cache misses (trashing), but less L3 misses (better locality). Along with (more important) better latency compensation, these explain why SIMD32 improves performance in them. More advanced heuristics that try to avoid the SIMD32 performance regressions, unfortunately also get rid of clear part of the above improvements. Such heuristics would need improved instruction scheduler Heuristics for things affecting texture fetch latencies would help, like how many fetches there are, to how many different textures and how close together they are vs. how large caches there are and how fast RAM. - Eero that provides feedback on which shaders have latency issues where SIMD32 would help. (A potential run-time heuristics would be disabling SIMD32 when too large textures are bound for draw.) - Eero Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as writing accumulator implicitly on pre-Gen7. intel/fs: Disable opt_sampler_eot() in 32-wide dispatch. i965: Add plumbing for shader time in 32-wide FS dispatch mode. intel/fs: Simplify fs_visitor::emit_samplepos_setup intel/fs: Use fs_regs instead of brw_regs in the unlit centroid workaround intel/fs: Wrap FS payload register
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
Hi, On 29.05.2018 18:58, Eero Tamminen wrote: On 25.05.2018 00:55, Jason Ekstrand wrote: This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Tested v3 on BXT & SKL. Everything seems to work fine. Everything works fine also on GEN8 (BSW & BDW GT2), but half the tests invoke GPU hangs on GEN7 (BYT & HSW GT2). One option would be to support SIMD32 just for GEN8+. Tested-by Eero Tamminen Figuring out a good heuristic is left as an exercise to the reader. :-) Simple heuristic that just enables SIMD32 for everything that isn't MRT shader, gives nice perf improvements on BXT J4205: * +30% GfxBench ALU2 * +25% SynMark PSPom * +10% GpuTest Julia32 * +9% GfxBench CarChase * +7% GfxBench Manhattan 3.0 * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, VSInstancing & ZBuffer * -2-3% GLB 2.7 Fill * -4-5% MemBW Blend > On SKL, perf differences are smaller. On GEN8, the improvements are smaller and regressions larger with the same heuristic. Main difference with the 12EU single channel BSW, is -15% regression in perf of SynMark FillTexMulti, i.e. sampling 8 textures and writing out their average value. With single-channel memory, increased memory latency causes a lot more trashing with SIMD32 when many textures are being sampled close together. SIMD32 can cause write bound tests to trash, which is visible as perf regression in fully write bound tests above (that's also the reason why SIMD32 is good to disable with MRT shaders). As to reads, SIMD32 improves cache locality until it starts trashing. In above GfxBench tests, and amount of texture sampling they do, this shows in HW counters as increased texture cache misses (trashing), but less L3 misses (better locality). Along with (more important) better latency compensation, these explain why SIMD32 improves performance in them. More advanced heuristics that try to avoid the SIMD32 performance regressions, unfortunately also get rid of clear part of the above improvements. Such heuristics would need improved instruction scheduler Heuristics for things affecting texture fetch latencies would help, like how many fetches there are, to how many different textures and how close together they are vs. how large caches there are and how fast RAM. - Eero that provides feedback on which shaders have latency issues where SIMD32 would help. (A potential run-time heuristics would be disabling SIMD32 when too large textures are bound for draw.) - Eero Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as writing accumulator implicitly on pre-Gen7. intel/fs: Disable opt_sampler_eot() in 32-wide dispatch. i965: Add plumbing for shader time in 32-wide FS dispatch mode. intel/fs: Simplify fs_visitor::emit_samplepos_setup intel/fs: Use fs_regs instead of brw_regs in the unlit centroid workaround intel/fs: Wrap FS payload register look-up in a helper function. intel/fs: Extend thread payload layout to SIMD32 intel/fs: Implement 32-wide FS payload setup on Gen6+ intel/fs: Fix Gen7 compressed source region alignment restriction for SIMD32 intel/fs: Fix sample id setup for SIMD32. intel/fs: Generalize the unlit centroid workaround intel/fs: Fix Gen6+ interpolation setup for SIMD32 intel/fs: Fix
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
Hi, On 29.05.2018 18:58, Eero Tamminen wrote: On 25.05.2018 00:55, Jason Ekstrand wrote: This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Tested v3 on BXT & SKL. Everything seems to work otherwise fine. s/otherwise// - Eero (regardless of how many times one reads a mail before sending, there always seems to be some leftover one misses.) Tested-by Eero Tamminen Figuring out a good heuristic is left as an exercise to the reader. :-) Simple heuristic that just enables SIMD32 for everything that isn't MRT shader, gives nice perf improvements on BXT J4205: * +30% GfxBench ALU2 * +25% SynMark PSPom * +10% GpuTest Julia32 * +9% GfxBench CarChase * +7% GfxBench Manhattan 3.0 * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, VSInstancing & ZBuffer * -2-3% GLB 2.7 Fill * -4-5% MemBW Blend On SKL, perf differences are smaller. SIMD32 can cause write bound tests to trash, which is visible as perf regression in fully write bound tests above (that's also the reason why SIMD32 is good to disable with MRT shaders). As to reads, SIMD32 improves cache locality until it starts trashing. In above GfxBench tests, and amount of texture sampling they do, this shows in HW counters as increased texture cache misses (trashing), but less L3 misses (better locality). Along with (more important) better latency compensation, these explain why SIMD32 improves performance in them. More advanced heuristics that try to avoid the SIMD32 performance regressions, unfortunately also get rid of clear part of the above improvements. Such heuristics would need improved instruction scheduler that provides feedback on which shaders have latency issues where SIMD32 would help. (A potential run-time heuristics would be disabling SIMD32 when too large textures are bound for draw.) - Eero Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as writing accumulator implicitly on pre-Gen7. intel/fs: Disable opt_sampler_eot() in 32-wide dispatch. i965: Add plumbing for shader time in 32-wide FS dispatch mode. intel/fs: Simplify fs_visitor::emit_samplepos_setup intel/fs: Use fs_regs instead of brw_regs in the unlit centroid workaround intel/fs: Wrap FS payload register look-up in a helper function. intel/fs: Extend thread payload layout to SIMD32 intel/fs: Implement 32-wide FS payload setup on Gen6+ intel/fs: Fix Gen7 compressed source region alignment restriction for SIMD32 intel/fs: Fix sample id setup for SIMD32. intel/fs: Generalize the unlit centroid workaround intel/fs: Fix Gen6+ interpolation setup for SIMD32 intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch. intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32. intel/fs: Build 32-wide FS shaders. Jason Ekstrand (19): intel/fs: Assert that the gen4-6 plane restrictions are followed intel/fs: Use groups for SIMD16 LINTERP on gen11+ intel/fs: FS_OPCODE_REP_FB_WRITE has side effects intel/fs: Properly track implied header regs read by FB writes intel/fs: Pull FB write implied headers from src[0] intel/fs: Set up FB write message headers in the visitor i965: Re-arrange shader kernel setup in WM state intel/compiler: Add and use helpers for working with KSP indices intel/fs:
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
Hi, On 25.05.2018 00:55, Jason Ekstrand wrote: This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Tested v3 on BXT & SKL. Everything seems to work otherwise fine. Tested-by Eero Tamminen Figuring out a good heuristic is left as an exercise to the reader. :-) Simple heuristic that just enables SIMD32 for everything that isn't MRT shader, gives nice perf improvements on BXT J4205: * +30% GfxBench ALU2 * +25% SynMark PSPom * +10% GpuTest Julia32 * +9% GfxBench CarChase * +7% GfxBench Manhattan 3.0 * +3-7% GLB T-Rex, SynMark ShMapVsm, GpuTest Triangle * +1-3% GfxBench Manhattan 3.1 & T-Rex, Unigine Heaven, GpuTest FurMark * -1-2% GfxBench Aztec Ruins, MemBW Write, SynMark DeferredAA, Fill*, VSInstancing & ZBuffer * -2-3% GLB 2.7 Fill * -4-5% MemBW Blend On SKL, perf differences are smaller. SIMD32 can cause write bound tests to trash, which is visible as perf regression in fully write bound tests above (that's also the reason why SIMD32 is good to disable with MRT shaders). As to reads, SIMD32 improves cache locality until it starts trashing. In above GfxBench tests, and amount of texture sampling they do, this shows in HW counters as increased texture cache misses (trashing), but less L3 misses (better locality). Along with (more important) better latency compensation, these explain why SIMD32 improves performance in them. More advanced heuristics that try to avoid the SIMD32 performance regressions, unfortunately also get rid of clear part of the above improvements. Such heuristics would need improved instruction scheduler that provides feedback on which shaders have latency issues where SIMD32 would help. (A potential run-time heuristics would be disabling SIMD32 when too large textures are bound for draw.) - Eero Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as writing accumulator implicitly on pre-Gen7. intel/fs: Disable opt_sampler_eot() in 32-wide dispatch. i965: Add plumbing for shader time in 32-wide FS dispatch mode. intel/fs: Simplify fs_visitor::emit_samplepos_setup intel/fs: Use fs_regs instead of brw_regs in the unlit centroid workaround intel/fs: Wrap FS payload register look-up in a helper function. intel/fs: Extend thread payload layout to SIMD32 intel/fs: Implement 32-wide FS payload setup on Gen6+ intel/fs: Fix Gen7 compressed source region alignment restriction for SIMD32 intel/fs: Fix sample id setup for SIMD32. intel/fs: Generalize the unlit centroid workaround intel/fs: Fix Gen6+ interpolation setup for SIMD32 intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch. intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32. intel/fs: Build 32-wide FS shaders. Jason Ekstrand (19): intel/fs: Assert that the gen4-6 plane restrictions are followed intel/fs: Use groups for SIMD16 LINTERP on gen11+ intel/fs: FS_OPCODE_REP_FB_WRITE has side effects intel/fs: Properly track implied header regs read by FB writes intel/fs: Pull FB write implied headers from src[0] intel/fs: Set up FB write message headers in the visitor i965: Re-arrange shader kernel setup in WM state intel/compiler: Add and use helpers for working with KSP indices intel/fs: Rework KSP data to be SIMD width-based intel/fs: Split instructions low to high in lower_simd_width intel/fs: Properly copy default flag reg for 3src instrucitons intel/fs: Add the
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
On Fri, May 25, 2018 at 11:50 AM, Matt Turnerwrote: > On Thu, May 24, 2018 at 2:55 PM, Jason Ekstrand wrote: >> This patch series adds back-end compiler support for SIMD32 fragment >> shaders. Support is added and everything works but it's currently hidden >> behind INTEL_DEBUG=do32. We know that it improves performance in some >> cases but we do not yet have a good enough heuristic to start turning it on >> by default. The objective of this series is to just to get the compiler >> infrastructure landed so that it stops bit-rotting in Curro's branch. >> Figuring out a good heuristic is left as an exercise to the reader. :-) > > 1-6, 8-20 are > > Reviewed-by: Matt Turner 7, 22-31 are too. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
On Thu, May 24, 2018 at 2:55 PM, Jason Ekstrandwrote: > This patch series adds back-end compiler support for SIMD32 fragment > shaders. Support is added and everything works but it's currently hidden > behind INTEL_DEBUG=do32. We know that it improves performance in some > cases but we do not yet have a good enough heuristic to start turning it on > by default. The objective of this series is to just to get the compiler > infrastructure landed so that it stops bit-rotting in Curro's branch. > Figuring out a good heuristic is left as an exercise to the reader. :-) 1-6, 8-20 are Reviewed-by: Matt Turner ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH 00/53] intel/fs: SIMD32 support for fragment shaders
This patch series adds back-end compiler support for SIMD32 fragment shaders. Support is added and everything works but it's currently hidden behind INTEL_DEBUG=do32. We know that it improves performance in some cases but we do not yet have a good enough heuristic to start turning it on by default. The objective of this series is to just to get the compiler infrastructure landed so that it stops bit-rotting in Curro's branch. Figuring out a good heuristic is left as an exercise to the reader. :-) Francisco Jerez (34): intel/eu: Remove brw_codegen::compressed_stack. intel/fs: Rename a local variable so it doesn't shadow component() intel/fs: Use the ATTR file for FS inputs intel/fs: Replace the CINTERP opcode with a simple MOV intel/fs: Add explicit last_rt flag to fb writes orthogonal to eot. intel/fs: Fix Gen4-5 FB write AA data payload munging for non-EOT writes. intel/eu: Return new instruction to caller from brw_fb_WRITE(). intel/fs: Fix fs_inst::flags_written() for Gen4-5 FB writes. intel/fs: Fix implied_mrf_writes() for headerless FB writes. intel/fs: Remove program key argument from generator. intel/fs: Disable SIMD32 dispatch on Gen4-6 with control flow intel/fs: Disable SIMD32 dispatch for fragment shaders with discard. intel/eu: Fix pixel interpolator queries for SIMD32. intel/fs: Fix codegen of FS_OPCODE_SET_SAMPLE_ID for SIMD32. intel/fs: Don't enable dual source blend if no outputs are written intel/fs: Fix FB write message control codegen for SIMD32. intel/fs: Fix logical FB write lowering for SIMD32 intel/fs: Fix FB read header setup for SIMD32. intel/fs: Rework INTERPOLATE_AT_PER_SLOT_OFFSET intel/fs: Mark LINTERP opcode as writing accumulator implicitly on pre-Gen7. intel/fs: Disable opt_sampler_eot() in 32-wide dispatch. i965: Add plumbing for shader time in 32-wide FS dispatch mode. intel/fs: Simplify fs_visitor::emit_samplepos_setup intel/fs: Use fs_regs instead of brw_regs in the unlit centroid workaround intel/fs: Wrap FS payload register look-up in a helper function. intel/fs: Extend thread payload layout to SIMD32 intel/fs: Implement 32-wide FS payload setup on Gen6+ intel/fs: Fix Gen7 compressed source region alignment restriction for SIMD32 intel/fs: Fix sample id setup for SIMD32. intel/fs: Generalize the unlit centroid workaround intel/fs: Fix Gen6+ interpolation setup for SIMD32 intel/fs: Fix fs_builder::sample_mask_reg() for 32-wide FS dispatch. intel/fs: Fix nir_intrinsic_load_helper_invocation for SIMD32. intel/fs: Build 32-wide FS shaders. Jason Ekstrand (19): intel/fs: Assert that the gen4-6 plane restrictions are followed intel/fs: Use groups for SIMD16 LINTERP on gen11+ intel/fs: FS_OPCODE_REP_FB_WRITE has side effects intel/fs: Properly track implied header regs read by FB writes intel/fs: Pull FB write implied headers from src[0] intel/fs: Set up FB write message headers in the visitor i965: Re-arrange shader kernel setup in WM state intel/compiler: Add and use helpers for working with KSP indices intel/fs: Rework KSP data to be SIMD width-based intel/fs: Split instructions low to high in lower_simd_width intel/fs: Properly copy default flag reg for 3src instrucitons intel/fs: Add the group to the flag subreg number on SNB and older intel/fs: Emit LINE+MAC for LINTERP with unaligned coordinates intel/fs: Emit MOV_DISPATCH_TO_FLAGS once for the centroid workaround intel/fs: Get rid of MOV_DISPATCH_TO_FLAGS intel/fs: Add fields to wm_prog_data for SIMD32 dispatch intel/anv,blorp,i965: Implement the SKL 16x MSAA SIMD32 workaround intel/fs: Remove support push constants in repclear shaders intel/fs: Support SIMD32 repclear shaders src/intel/blorp/blorp.c | 2 +- src/intel/blorp/blorp_genX_exec.h | 82 +++- src/intel/compiler/brw_compiler.h | 98 +++- src/intel/compiler/brw_eu.h | 21 +- src/intel/compiler/brw_eu_defines.h | 2 - src/intel/compiler/brw_eu_emit.c | 39 +- src/intel/compiler/brw_fs.cpp | 666 -- src/intel/compiler/brw_fs.h | 53 +- src/intel/compiler/brw_fs_builder.h | 6 +- src/intel/compiler/brw_fs_cse.cpp | 1 - src/intel/compiler/brw_fs_generator.cpp | 318 ++-- src/intel/compiler/brw_fs_nir.cpp | 57 ++- src/intel/compiler/brw_fs_visitor.cpp | 193 src/intel/compiler/brw_ir_fs.h| 1 + src/intel/compiler/brw_shader.cpp | 12 +- src/intel/compiler/brw_vec4.cpp | 2 +- src/intel/compiler/brw_vec4_gs_visitor.cpp| 2 +- src/intel/compiler/brw_vec4_tcs.cpp | 2 +- src/intel/compiler/brw_wm_iz.cpp | 11 +- src/intel/vulkan/anv_pipeline.c | 2 +- src/intel/vulkan/genX_pipeline.c | 40 +-