Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-30 Thread Roland Scheidegger
FWIW I'm not really qualified to review this, but this alleviates the
concerns I had some time ago about doing spilling for r600 before sb.
So this makes all sense to me.

Roland

Am 28.04.2018 um 21:30 schrieb Gert Wollny:
> this is another update of the series I've sent before. 
> 
> v3: 
>  - Add new test mesa/st/tests/meson.build
>  - rebase patches to latest HEAD
> 
> this is the merged version of two series [1] (TGSI: split, merge 
> and interleave arrays) and [2] (mesa/st/glsl_to_tgsi: Properly 
> resolve life times for simple if/else + use constructs) I sent
> earlier. Considering that both parts target the same optimization 
> step and fix a bug if both are applied, I thought it is better to 
> add this second patch to the series. Changes refer to v1 of [1]: 
> 
> v2: 
>  - rebase patches to latest HEAD
>  - add some code that allows obtaining some statistics about register
>and instruction usage
>  - Add patch [2] that improves resolving the live range estimation with 
>simple if/else and use constructs. By adding this patch the series 
>fixes https://bugs.freedesktop.org/show_bug.cgi?id=105371
> 
>  v1: 
> Patch 1: Split arrays that are only accessed directly:  
> I posted a first version off the the array splitting in patch 1 some 
> time ago. Eric Anholt pointed out that this might be done in 
> opt_array_splitting.cpp, but in another comment Timothy pointed out 
> that this is far from trivial, and he also pointed out that he was 
> proposing similar patches for NIR, but since currently no NIR->TGSI 
> transformation is available, TGSI based drivers can't make use of this. 
> 
> While the reminder off the series could be applied without this patch, I 
> think it makes less sense to do all the optimizations on arrays that could 
> simply be split into individual registers, so I repost the patch with some 
> changes. 
> 
> I tried to be exhaustive with comments and make the variable any type names 
> self-explaining, but since I've been staring at this code for a long time I
> don't think I am capable of seeing any problems any more, so comments are 
> very 
> welcome.
> 
> Best, 
> Gert 
> 
> PS: I have no git write access. 
> 
> 
> [1] https://patchwork.freedesktop.org/series/37991/ 
> [2] https://patchwork.freedesktop.org/series/39471/
> 
> 
> Gert Wollny (13):
>   mesa/st/glsl_to_tgsi: Add method to collect some statistics
>   mesa/st: glsl_to_tgsi: Split arrays who's elements are only accessed
> directly
>   mesa/st/glsl_to_tgsi: Properly resolve life times simple if/else + use
> constructs
>   mesa/st/glsl_to_tgsi:rename lifetime to register_live_range
>   mesa/st: Add helper classes for array merging and interleaving
>   mesa/st/glsl_to_tgsi: Add class to track array live range
>   mesa/st/glsl_to_tgsi:Add array merge logic
>   mesa/st/tests: Add tests for array merge helper classes.
>   mesa/st/glsl_to_tgsi: rename access_record to register_merge_record
> and some more renames
>   mesa/st/glsl_to_tgsi: move evaluation of read mask up in the call
> hierarchy
>   mesa/st/glsl_to_tgsi: add class for array access tracking
>   mesa/st/glsl_to_tgsi: add array life range evaluation into tracking
> code
>   mesa/st/glsl_to_tgsi: Expose array live range tracking  and merging
> 
>  src/mesa/Makefile.sources  |   2 +
>  src/mesa/meson.build   |   2 +
>  src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 205 +-
>  .../state_tracker/st_glsl_to_tgsi_array_merge.cpp  | 739 
> +
>  .../state_tracker/st_glsl_to_tgsi_array_merge.h| 186 ++
>  .../state_tracker/st_glsl_to_tgsi_temprename.cpp   | 304 ++---
>  .../state_tracker/st_glsl_to_tgsi_temprename.h |  45 +-
>  src/mesa/state_tracker/tests/Makefile.am   |  20 +-
>  src/mesa/state_tracker/tests/meson.build   |  14 +
>  src/mesa/state_tracker/tests/st_tests_common.cpp   |  25 +-
>  src/mesa/state_tracker/tests/st_tests_common.h |  20 +-
>  .../tests/test_glsl_to_tgsi_array_merge.cpp| 296 +
>  .../tests/test_glsl_to_tgsi_lifetime.cpp   |  33 +-
>  13 files changed, 1760 insertions(+), 131 deletions(-)
>  create mode 100644 src/mesa/state_tracker/st_glsl_to_tgsi_array_merge.cpp
>  create mode 100644 src/mesa/state_tracker/st_glsl_to_tgsi_array_merge.h
>  create mode 100644 
> src/mesa/state_tracker/tests/test_glsl_to_tgsi_array_merge.cpp
> 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-30 Thread Gert Wollny

[...]
> 
> $R600_DEBUG=merge ST_DEBUG=tgsi ./run wollny/
> 
> ATTENTION: default value of option
> allow_glsl_extension_directive_midshader overridden by environment.
> run: state_tracker/st_glsl_to_tgsi.cpp:5783: ureg_dst
> dst_register(st_translate*, gl_register_file, unsigned int, unsigned
> int): Assertion `array_id && array_id <= t->num_temp_arrays' failed.

This would indicate that the array is split, but still addressed
somewhere, i.e. I don't look at all possible locations where arrays are
used. I think I debug this by letting the R600 driver advertise
bindless texture support (obvioulsy it will not pass the tests, because
it is not implemented and  the hardware may not even support it, but
the GLSL->TGSI conversion should still work. 


> 
> 
> I can, if you dont mind waiting for an answer sometimes.
> 
I will rely on this, but I also will take some time do work on it. 


> But maybe even easier: is there an implicit/explicit magic number I
> can play with to see if it changes anything?

Unfortunately, in my code there are no specific magic numbers to
change, if there is anything then it is probably in the LLVM backend
(that is not used with r600). 

The only think you could try is to disable one of the two parts in my
code: one is the split_array(), called around 

  st_glsl_to_tgsi:7018 

and the other is where the array merging and interleaving are applied,
i.e. line 

  st_glsl_to_tgsi:5537

My take is that this latter step is responsible for the regressions, 
especially the array interleaving could make it more difficult to
optimize, because if the optimizer pass doesn't look at a per-component 
access, but on the higher level per-register base, then register live-
ranges for the array elements will become longer, making it more
difficult to schedule everything without spilling.


> ATM it seems like your code improves half the shaders its affecting a
> lot and hurting the other half bad like it hits an invisible wall.
> I think one problem could be the relationship between VGPRs and SGPRs
> used and max Wavefronts achieved.
Unfortunately, on r600 such fine grained information is not available,
so that I also have no real clue. 

Cheer,
Gert

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Benedikt Schemmer


Am 29.04.2018 um 21:44 schrieb Gert Wollny:
> Hello Benedict,
> 
> thanks for all the testing!

thanks for all the developing ;)

> 
> On 29.04.2018 12:12, Benedikt Schemmer wrote:
>>> Which are the names of these test? I'd like to check this on r600,
>>> because here I didn't see any regressions last time I checked. 
>>>
>> might of course be different on r600 (is bindless available?),
>> also shader-db is more sensitive to problems than piglit
>>
>> 1. tests/spec/arb_bindless_texture/compiler/images/arrays-of-struct.frag
>> 2. tests/spec/arb_bindless_texture/compiler/samplers/arrays-of-struct.frag
> Indeed, bindless testures are not available on r600, so it is quite
> difficult for me to test this. I would guess that parameters related to
> this might be stored in the TGSI declaration that I currently don't check.
> 
> If you have time for it, could you send me a TGSI dump of one of these
> shaders?
> With "ST_DEBUG=tgsi" this should be possible.

I created 'R600_DEBUG=merge' so I can switch without having to recompile.


$ST_DEBUG=tgsi ./run wollny/

ATTENTION: default value of option allow_glsl_extension_directive_midshader 
overridden by environment.
FRAG
DCL TEMP[0..55], ARRAY(1), LOCAL
IMM[0] INT32 {0, 0, 0, 0}
IMM[1] FLT32 {1., 2., 3., 4.}
  0: STORE TEMP[0], IMM[0]., IMM[1], 2D
  1: END

wollny/41d88325fd2a57cb6af40de02dc281ee0683cc40_2.shader_test - LLVM diagnostic 
(remark): :0:0: 12 instructions in function
wollny/41d88325fd2a57cb6af40de02dc281ee0683cc40_2.shader_test - Shader Stats: 
SGPRS: 16 VGPRS: 16 Code Size: 80 LDS: 0 Scratch: 0 Max Waves: 8 Spilled SGPRs: 
0 Spilled VGPRs: 0 PrivMem VGPRs: 0
FRAG
DCL OUT[0], COLOR
DCL TEMP[0..1], LOCAL
DCL TEMP[2..57], ARRAY(1), LOCAL
IMM[0] FLT32 {0., 0., 0., 0.}
  0: MOV TEMP[0].xy, IMM[0].
  1: TEX TEMP[1], TEMP[0], TEMP[2].xyxy, 2D
  2: MOV OUT[0], TEMP[1]
  3: END

wollny/a115868a349cd666b842a0e70f47451b4463903a_2.shader_test - LLVM diagnostic 
(remark): :0:0: 11 instructions in function
wollny/a115868a349cd666b842a0e70f47451b4463903a_2.shader_test - Shader Stats: 
SGPRS: 24 VGPRS: 16 Code Size: 72 LDS: 0 Scratch: 0 Max Waves: 8 Spilled SGPRs: 
0 Spilled VGPRs: 0 PrivMem VGPRs: 0
Thread 0 took 0.13 seconds and compiled 2 shaders (not including SIMD16) with 1 
GL context switches



$R600_DEBUG=merge ST_DEBUG=tgsi ./run wollny/

ATTENTION: default value of option allow_glsl_extension_directive_midshader 
overridden by environment.
run: state_tracker/st_glsl_to_tgsi.cpp:5783: ureg_dst 
dst_register(st_translate*, gl_register_file, unsigned int, unsigned int): 
Assertion `array_id && array_id <= t->num_temp_arrays' failed.

 => CRASHED <= while processing these shaders:

wollny/41d88325fd2a57cb6af40de02dc281ee0683cc40_2.shader_test

>>
>>> For radeonsi my guess would be that the llvm optimizer works better
>>> when the registers are not yet merged, and that would be the reason why
>>> register_merge is disabled. 
>> well at least sometimes it doesn't, low hanging fruit maybe?
> Unfortunately, I can't test on radeonsi

I can, if you dont mind waiting for an answer sometimes.

But maybe even easier: is there an implicit/explicit magic number I can play 
with to see if it changes anything?

ATM it seems like your code improves half the shaders its affecting a lot and 
hurting the other half bad like it hits an invisible wall.
I think one problem could be the relationship between VGPRs and SGPRs used and 
max Wavefronts achieved.

This is somewhat similar to NIR although that changes things all over the place.

> 
> Best,
> Gert
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Gert Wollny
Hello Benedict,

thanks for all the testing!

On 29.04.2018 12:12, Benedikt Schemmer wrote:
>> Which are the names of these test? I'd like to check this on r600,
>> because here I didn't see any regressions last time I checked. 
>>
> might of course be different on r600 (is bindless available?),
> also shader-db is more sensitive to problems than piglit
>
> 1. tests/spec/arb_bindless_texture/compiler/images/arrays-of-struct.frag
> 2. tests/spec/arb_bindless_texture/compiler/samplers/arrays-of-struct.frag
Indeed, bindless testures are not available on r600, so it is quite
difficult for me to test this. I would guess that parameters related to
this might be stored in the TGSI declaration that I currently don't check.

If you have time for it, could you send me a TGSI dump of one of these
shaders?
With "ST_DEBUG=tgsi" this should be possible.
>
>> For radeonsi my guess would be that the llvm optimizer works better
>> when the registers are not yet merged, and that would be the reason why
>> register_merge is disabled. 
> well at least sometimes it doesn't, low hanging fruit maybe?
Unfortunately, I can't test on radeonsi

Best,
Gert
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Benedikt Schemmer
sorry the last one wasnt correct:

register merge vs yours

Max Increase:

SGPRS: 80 -> 96 (20.00 %) (in 
shaders/deusex_mankind/7c39f71090a9db19ac2e1542ea12804ae6c6495b_4864.shader_test)
VGPRS: 64 -> 84 (31.25 %) (in 
shaders/dirtrally/0859b69789591d7046e211400b1edd9a7cfca734_742.shader_test)
Spilled SGPRs: 0 -> 16 (0.00 %) (in 
shaders/deusex_mankind/d64e2084204e29749639e8fbd9a1e507c7e5e1dd_6840.shader_test)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 24 -> 32 (33.33 %) (in 
shaders/deusex_mankind/28cac87049d8c833e72296a5a02ea6118f1144e5_5876.shader_test)
Scratch size: 28 -> 36 (28.57 %) dwords per thread (in 
shaders/deusex_mankind/28cac87049d8c833e72296a5a02ea6118f1144e5_5876.shader_test)
Code Size: 6988 -> 8036 (15.00 %) bytes (in shaders/cat/1847.shader_test)
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 5 -> 7 (40.00 %) (in 
shaders/ruiner/0967c5fce7fc456496b1cfa25fbb1d1c4dcf9bed_2958.shader_test)
Wait states: 0 -> 0 (0.00 %)

Max Decrease:

SGPRS: 96 -> 72 (-25.00 %) (in 
shaders/deusex_mankind/c1f098c7b14b1ba291cfa9bba4e41ba91acaba30_3630.shader_test)
VGPRS: 80 -> 68 (-15.00 %) (in 
shaders/dirtrally/710d3319bc986ea003f1a84ec6d3c01b2a8b9ded_2482.shader_test)
Spilled SGPRs: 19 -> 0 (-100.00 %) (in 
shaders/deusex_mankind/0749c9ae23417f918c7286fe502ff5de4cb8e1a0_3276.shader_test)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 9080 -> 8712 (-4.05 %) bytes (in 
shaders/dirtrally/2e1a8e04b49dee3e39b6f30a7c707f395abf_2480.shader_test)
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 8 -> 5 (-37.50 %) (in 
shaders/deusex_mankind/8dabec49e5b6c3b1cbcbaee194eff69f164d72f4_3968.shader_test)
Wait states: 0 -> 0 (0.00 %)


 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 0ad6 . . . . . 
. . . .
 aer  590 . . . . . 
. . . .
 alien_isolation 1414 . . . . . 
. . . .
 anholt10 . . . . . 
. . . .
 bioshock_infinite   2581 .0.04 % . . . 
.0.01 % . .
 blackmesa584 . . . . . 
. . . .
 cat  573   -0.06 %   -0.09 % . . . 
.0.08 %0.05 % .
 csgo1392 . . . . . 
. . . .
 deadisland_definitive   1776 . . . . . 
. . . .
 deadisland_original11602 . . . . . 
. . . .
 deadisland_riptide_..293 . . . . . 
. . . .
 deusex_mankind  5051   -0.06 % .   -4.64 % .   33.33 % 
  28.57 % .   -0.01 % .
 dirtrally7870.04 %0.49 %   -0.46 % . . 
.0.03 %   -0.23 % .
 dolphin   22 . . . . . 
. . . .
 dyinglight  4012 . . . . . 
. . . .
 eurotruck2   216 . . . . . 
. . . .
 f1_2015  746 .   -0.03 %1.77 % . . 
.0.05 % . .
 glamor16 . . . . . 
. . . .
 hl2ep1   294 . . . . . 
. . . .
 hl2ep2   154 . . . . . 
. . . .
 hl2lostcoast  66 . . . . . 
. . . .
 hlsl3582 . . . . . 
. . . .
 humus-celshading   4 . . . . . 
. . . .
 humus-domino   6 . . . . . 
. . . .
 humus-dynamicbranching24 . . . . . 
. . . .
 humus-hdr 10 . . . . . 
. . . .
 humus-portals  2 . . . . . 
. . .  

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Benedikt Schemmer
Real world

(old vs new)
nothing vs register merge:

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR  
 Scratch  CodeSize  MaxWaves Waits
 0ad6 . . . . . 
. . . .
 aer  590 .0.26 %  -20.00 % . . 
.0.34 % . .
 alien_isolation 1414 . . . . . 
. . . .
 anholt10 . . . . . 
. . . .
 bioshock_infinite   2581   -0.02 % . . . . 
.0.12 % . .
 blackmesa584 . . . . . 
. . . .
 cat  573 .   -0.02 % . . . 
.0.11 % . .
 csgo1392 . .   -0.88 % . . 
.   -0.03 % . .
 deadisland_definitive   17760.06 % . . . . 
.0.15 %0.01 % .
 deadisland_original11602 . . . . . 
.0.05 % . .
 deadisland_riptide_..293   -0.06 %0.06 % . . . 
.0.32 % . .
 deusex_mankind  50510.14 %   -0.01 %   -1.57 % . . 
.0.18 % . .
 dirtrally787   -0.04 %0.15 %1.08 % . . 
.0.27 %   -0.07 % .
 dolphin   22 . . . . . 
. . . .
 dyinglight  4012 .0.05 % . . . 
.0.34 %   -0.01 % .
 eurotruck2   216 . . . . . 
. . . .
 f1_2015  746   -0.04 %0.02 %0.93 % . . 
.0.08 % . .
 glamor16   -2.33 % . . . . 
.3.97 % . .
 hl2ep1   294 . . . . . 
. . . .
 hl2ep2   154 . . . . . 
. . . .
 hl2lostcoast  66 . . . . . 
. . . .
 hlsl3582 . . . . . 
.   -0.14 % . .
 humus-celshading   4 . . . . . 
. . . .
 humus-domino   6 . . . . . 
. . . .
 humus-dynamicbranching24 . . . . . 
. . . .
 humus-hdr 10 . . . . . 
. . . .
 humus-portals  2 . . . . . 
. . . .
 humus-volumetricfog..  6 . . . . . 
. . . .
 kerbal  1016 .0.11 % . . . 
.0.31 % . .
 larago   664 . . . . . 
.0.01 % . .
 madmax   354 .   -0.08 % . . . 
.   -0.02 %0.04 % .
 metro2033redux  4410 .0.03 % . . . 
.0.06 %   -0.01 % .
 nexuiz80 . . . . . 
. . . .
 ruiner   685 .   -0.02 % . . . 
.0.04 % . .
 sauerbraten7 . . . . . 
. . . .
 serioussam2017   736 . .7.59 % . . 
.0.05 % . .
 soma 436 . . . . . 
. . . .
 specops 1814 . . . . . 
.0.35 % . .
 stellaris434 . . . . . 
.0.11 % . .
 supertuxkart   4 . . . . . 
. . . .
 talos762   -0.02 % .0.09 % . . 
.0.01 % . .
 tesseract

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Benedikt Schemmer


Am 29.04.2018 um 11:34 schrieb Gert Wollny:
> Am Sonntag, den 29.04.2018, 10:43 +0200 schrieb Benedikt Schemmer:
>> Hi Gert,
>>
>> couldn't resist at least to try what would happen if I enable
>> register merge for radeonsi:
>>
>>  PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR
>> SpillVGPR  PrivVGPR   Scratch  CodeSize  MaxWaves Waits
>>  piglit 80732   -0.16 %   -0.02
>> % . .0.87 %0.86 %0.04 % . .
>>  --
>> 
>>  All affected 513  -17.58 %   -2.30
>> % . .4.12 %5.87 %1.73 %0.10 % .
>>  --
>> 
>>  Total  80732   -0.16 %   -0.02
>> % . .0.87 %0.86 %0.04 % . .
>>
>> I had already removed the defines around the debug code so thats also
>> happily outputting data.
>>
>> fails with two piglit shaders:
> Which are the names of these test? I'd like to check this on r600,
> because here I didn't see any regressions last time I checked. 
> 
might of course be different on r600 (is bindless available?),
also shader-db is more sensitive to problems than piglit

1. tests/spec/arb_bindless_texture/compiler/images/arrays-of-struct.frag
2. tests/spec/arb_bindless_texture/compiler/samplers/arrays-of-struct.frag

> 
>> Real world is a little different:
> 
> I guess these tests refer to enabled register_merge - without and with
> this patch set, no? 
> 
> Out of curiosity, did you also look at how enabling register_merge
> (before this series) impacts the result as compared to the normal
> operation of radeonsi? 

no I didn't
but if I do

(old vs new)
nothing vs register merge:

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 piglit 80732 . . . . . 
.0.06 % . .
 
--
 All affected 4350.89 %0.36 % . . . 
.3.61 %   -0.06 % .
 
--
 Total  80732 . . . . . 
.0.06 % . .

register merge vs yours:

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 piglit 80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %   -0.02 % . .
 
--
 All affected  83  -56.92 %  -14.22 % . .   11.67 % 
  16.67 %   -2.86 %0.93 % .
 
--
 Total  80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %   -0.02 % . .

nothing vs yours:

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 piglit 80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %0.04 % . .
 
--
 All affected 513  -17.58 %   -2.30 % . .4.12 % 
   5.87 %1.73 %0.10 % .
 
--
 Total  80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %0.04 % . .


>  
> 
>> If theres an easy way to figure out when your code makes it worse and
>> when its an improvement this would be really nice.
> 
> My insentive for this series was, that on r600 the arrays are allocated
>  before the final optimization pass on the byte code that requires that
> the number of registers is <= 124. When I started this no spilling was
> implemented, and shaders with too many arrays and registers would
> simply fail. Now spilling is impelmented, but AFAIK reducing the
> numbers of registers in the final optimization pass does not result in
> changed spilling, so bringing down the number of registers before tgsi-
> to-bytecode is still of interest. 
> 
> For radeonsi my guess would be that the llvm optimizer works better
> when the registers are not yet merged, and that would be the reason why
> register_merge is disabled. 

well at least sometimes it doesn't, low hanging fruit 

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Gert Wollny
Am Sonntag, den 29.04.2018, 10:43 +0200 schrieb Benedikt Schemmer:
> Hi Gert,
> 
> couldn't resist at least to try what would happen if I enable
> register merge for radeonsi:
> 
>  PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR
> SpillVGPR  PrivVGPR   Scratch  CodeSize  MaxWaves Waits
>  piglit 80732   -0.16 %   -0.02
> % . .0.87 %0.86 %0.04 % . .
>  --
> 
>  All affected 513  -17.58 %   -2.30
> % . .4.12 %5.87 %1.73 %0.10 % .
>  --
> 
>  Total  80732   -0.16 %   -0.02
> % . .0.87 %0.86 %0.04 % . .
> 
> I had already removed the defines around the debug code so thats also
> happily outputting data.
> 
> fails with two piglit shaders:
Which are the names of these test? I'd like to check this on r600,
because here I didn't see any regressions last time I checked. 


> Real world is a little different:

I guess these tests refer to enabled register_merge - without and with
this patch set, no? 

Out of curiosity, did you also look at how enabling register_merge
(before this series) impacts the result as compared to the normal
operation of radeonsi? 
 

> If theres an easy way to figure out when your code makes it worse and
> when its an improvement this would be really nice.

My insentive for this series was, that on r600 the arrays are allocated
 before the final optimization pass on the byte code that requires that
the number of registers is <= 124. When I started this no spilling was
implemented, and shaders with too many arrays and registers would
simply fail. Now spilling is impelmented, but AFAIK reducing the
numbers of registers in the final optimization pass does not result in
changed spilling, so bringing down the number of registers before tgsi-
to-bytecode is still of interest. 

For radeonsi my guess would be that the llvm optimizer works better
when the registers are not yet merged, and that would be the reason why
register_merge is disabled. 

In any case, Timothy wrote in this thread [1] (last message) that he
had similar patches for NIR. 

Best, 
Gert 

[1] https://patchwork.freedesktop.org/patch/189842/

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Benedikt Schemmer
Hi Gert,

couldn't resist at least to try what would happen if I enable register merge 
for radeonsi:

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 piglit 80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %0.04 % . .
 
--
 All affected 513  -17.58 %   -2.30 % . .4.12 % 
   5.87 %1.73 %0.10 % .
 
--
 Total  80732   -0.16 %   -0.02 % . .0.87 % 
   0.86 %0.04 % . .

I had already removed the defines around the debug code so thats also happily 
outputting data.

fails with two piglit shaders:


[require]
GLSL >= 3.30

[fragment shader]
// [config]
// expect_result: pass
// glsl_version: 3.30
// require_extensions: GL_ARB_bindless_texture GL_ARB_shader_image_load_store
// [end config]

#version 330
#extension GL_ARB_bindless_texture: require
#extension GL_ARB_shader_image_load_store: enable
#extension GL_ARB_arrays_of_arrays: enable

struct s {
writeonly image2D img[3][2];
int y;
};

void main()
{
s a[2][4];
imageStore(a[0][0].img[0][0], ivec2(0, 0), vec4(1, 2, 3, 4));
}


and


[require]
GLSL >= 3.30

[fragment shader]
// [config]
// expect_result: pass
// glsl_version: 3.30
// require_extensions: GL_ARB_bindless_texture
// [end config]

#version 330
#extension GL_ARB_bindless_texture: require
#extension GL_ARB_arrays_of_arrays: enable

struct s {
sampler2D tex[3][2];
int y;
};

out vec4 color;

void main()
{
s a[2][4];
color = texture2D(a[0][0].tex[0][0], vec2(0, 0));
}


Real world is a little different:

Max Increase:

SGPRS: 72 -> 96 (33.33 %) (in shaders/cat/1787.shader_test)
VGPRS: 64 -> 84 (31.25 %) (in 
shaders/dirtrally/0859b69789591d7046e211400b1edd9a7cfca734_742.shader_test)
Spilled SGPRs: 0 -> 16 (0.00 %) (in 
shaders/deusex_mankind/d64e2084204e29749639e8fbd9a1e507c7e5e1dd_6840.shader_test)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 24 -> 32 (33.33 %) (in 
shaders/deusex_mankind/28cac87049d8c833e72296a5a02ea6118f1144e5_5876.shader_test)
Scratch size: 28 -> 36 (28.57 %) dwords per thread (in 
shaders/deusex_mankind/28cac87049d8c833e72296a5a02ea6118f1144e5_5876.shader_test)
Code Size: 6988 -> 8036 (15.00 %) bytes (in shaders/cat/1847.shader_test)
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 5 -> 7 (40.00 %) (in 
shaders/ruiner/0967c5fce7fc456496b1cfa25fbb1d1c4dcf9bed_2958.shader_test)
Wait states: 0 -> 0 (0.00 %)

Max Decrease:

SGPRS: 104 -> 64 (-38.46 %) (in 
shaders/deusex_mankind/480ddf21b1076d36f9ffd9911389656b5d8e12cb_2878.shader_test)
VGPRS: 44 -> 36 (-18.18 %) (in 
shaders/ruiner/0967c5fce7fc456496b1cfa25fbb1d1c4dcf9bed_2958.shader_test)
Spilled SGPRs: 19 -> 0 (-100.00 %) (in 
shaders/deusex_mankind/0749c9ae23417f918c7286fe502ff5de4cb8e1a0_3276.shader_test)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 17576 -> 17276 (-1.71 %) bytes (in 
shaders/ruiner/75b96ff36f5328b9ff9366f0d0fd58a1046f51bc_3053.shader_test)
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 8 -> 5 (-37.50 %) (in 
shaders/deusex_mankind/8dabec49e5b6c3b1cbcbaee194eff69f164d72f4_3968.shader_test)
Wait states: 0 -> 0 (0.00 %)


 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR 
  Scratch  CodeSize  MaxWaves Waits
 0ad6 . . . . . 
. . . .
 aer  590 .0.26 %  -20.00 % . . 
.0.34 % . .
 alien_isolation 1414 . . . . . 
. . . .
 anholt10 . . . . . 
. . . .
 bioshock_infinite   2581   -0.02 %0.03 % . . . 
.0.13 % . .
 blackmesa584 . . . . . 
. . . .
 cat  573   -0.06 %   -0.12 % . . . 
.0.20 %0.05 % .
 csgo1392 . .   -0.88 % . . 
.   -0.03 % . .
 deadisland_definitive   17760.06 % . . . . 
.0.15 %0.01 % .
 deadisland_original11602 . . . . . 
.0.05 % . .
 deadisland_riptide_..293   -0.06 %0.06 % . . . 
.0.32 % . .
 deusex_mankind  50510.08 % 

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-29 Thread Gert Wollny
Hello Benedikt, 

Am Sonntag, den 29.04.2018, 00:06 +0200 schrieb Benedikt Schemmer:
> Hi Gert
> 
> Am 28.04.2018 um 23:51 schrieb Gert Wollny:
> > Am Samstag, den 28.04.2018, 22:43 +0200 schrieb Benedikt Schemmer:
> > > The patches apply cleanly, however I just did a shader-db test
> > > run
> > > and can't find a difference with your patch
> > > applied, am I doing something wrong?
> > 
> > AFAIK radeonsi doesn't use the register-merge optimizer in TGSI.
> > 
> 
> Ah, ok. Was wondering why your debug code doesn't output anything.
> Makes sense now ;)
Not exactly, the reason there is no output is because -DNDEBUG is set.
Without it the statistics should also be printed out on radeonsi, but
thinking of it I should probably disable it when register_merge is not
accessed, because without this the numbers will be inflated and don't
have much meaning.

> So is this useless on radeonsi?
Indeed. 

> Seemed interesting to me.
:) it certainly helps on r600 


> > > compile times went up though:
> > 
> > This is strange, because "see above". Did you compile with debug
> > information and c++11 or higher enables?
...
> >   
> > 
> not intentionally:

Then you should actually not run any code that this series adds to
mesa. I checked again, apart from the debugging output nothing will
ever be run if a drivers that report
PIPE_SHADER_CAP_TGSI_SKIP_MERGE_REGISTERS != 0 (as does radeonsi). 

Best, 
Gert 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-28 Thread Benedikt Schemmer
Hi Gert

Am 28.04.2018 um 23:51 schrieb Gert Wollny:
> Am Samstag, den 28.04.2018, 22:43 +0200 schrieb Benedikt Schemmer:
>> The patches apply cleanly, however I just did a shader-db test run
>> and can't find a difference with your patch
>> applied, am I doing something wrong?
> 
> AFAIK radeonsi doesn't use the register-merge optimizer in TGSI.
> 

Ah, ok. Was wondering why your debug code doesn't output anything.
Makes sense now ;)

So is this useless on radeonsi?
Seemed interesting to me.

>>
>> compile times went up though:
> This is strange, because "see above". Did you compile with debug
> information and c++11 or higher enables? In this case there is one
> access to a static variable (first patch)  that per c++11 standard
> should be thread save, which means that there might be a mutex
> protecting access to that variable, and this would explain the longer
> runtime in a multi-threaded environment.  
> 

not intentionally:

   prefix:  /usr/local
exec_prefix: ${prefix}
libdir:  ${prefix}/lib
includedir:  ${prefix}/include

OpenGL:  yes (ES1: yes ES2: yes)

OSMesa:  libOSMesa

DRI platform:drm
DRI drivers: i915 i965 nouveau r200 radeon swrast
DRI driver dir:  ${prefix}/lib/dri
GLX: DRI-based

EGL: yes
EGL drivers: builtin:egl_dri2 builtin:egl_dri3
GBM: yes
EGL/Vulkan/VL platforms:   x11 wayland drm

Vulkan drivers:  intel radeon
Vulkan ICD dir:  ${datarootdir}/vulkan/icd.d

llvm:yes
llvm-config: llvm-config-6.0
llvm-version:6.0.0

Gallium drivers: nouveau svga r600 r300 i915 virgl radeonsi swrast
Gallium st:  mesa xa xvmc vdpau omx_bellagio va nine clover

HUD extra stats: yes
HUD lmsensors:   yes


Shared libs: yes
Static libs: no
Shared-glapi:yes

CFLAGS:  -O3 -fstack-protector-strong -Wall -Wextra 
-Werror=format-security -fno-omit-frame-pointer -Wall 
-Werror=implicit-function-declaration -Werror=missing-prototypes
-Wmissing-prototypes -fno-math-errno -fno-trapping-math -std=c99
CXXFLAGS:-O3 -fstack-protector-strong -Wall -Wextra 
-Werror=format-security -fno-omit-frame-pointer -Wall -fno-math-errno 
-fno-trapping-math
CXX11_CXXFLAGS:
LDFLAGS: -Bsymbolic-functions -z relro
Macros:  -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS 
-D__STDC_LIMIT_MACROS -D_GNU_SOURCE -DUSE_SSE41 -DUSE_GCC_ATOMIC_BUILTINS 
-DNDEBUG -DTEXTURE_FLOAT_ENABLED -DUSE_X86_64_ASM
-DHAVE_SYS_SYSCTL_H -DHAVE_STRTOF -DHAVE_MKOSTEMP -DHAVE_TIMESPEC_GET 
-DHAVE_STRTOD_L -DHAVE_DL_ITERATE_PHDR -DHAVE_POSIX_MEMALIGN -DHAVE_ZLIB 
-DHAVE_LINUX_FUTEX_H -DHAVE_GALLIUM_EXTRA_HUD=1
-DHAVE_LIBSENSORS=1 -DHAVE_LIBDRM -DGLX_USE_DRM -DGLX_INDIRECT_RENDERING 
-DGLX_DIRECT_RENDERING -DGLX_USE_TLS -DHAVE_X11_PLATFORM 
-DHAVE_WAYLAND_PLATFORM -DWL_HIDE_DEPRECATED -DHAVE_DRM_PLATFORM
-DHAVE_DRI3 -DHAVE_DRI3_MODIFIERS -DENABLE_SHADER_CACHE -DHAVE_MINCORE 
-DHAVE_ST_VDPAU -DHAVE_LLVM=0x0600 -DMESA_LLVM_VERSION_PATCH=0

LLVM_CFLAGS: -I/usr/lib/llvm-6.0/include -D__STDC_CONSTANT_MACROS 
-D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
LLVM_CXXFLAGS:   -I/usr/lib/llvm-6.0/include -std=c++0x -std=c++11 
-D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
LLVM_CPPFLAGS:   -I/usr/lib/llvm-6.0/include -D__STDC_CONSTANT_MACROS 
-D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
LLVM_LDFLAGS:-L/usr/lib/llvm-6.0/lib

PYTHON2: python2.7

Run 'make' to build Mesa

Cheers,
Benedikt


> Best, 
> Gert
> 
>>
>> before:
>> Thread 3 took 113.72 seconds and compiled 17899 shaders (not
>> including SIMD16) with 2232 GL context switches
>> Thread 5 took 113.23 seconds and compiled 17767 shaders (not
>> including SIMD16) with 2150 GL context switches
>> Thread 7 took 116.63 seconds and compiled 18030 shaders (not
>> including SIMD16) with 2219 GL context switches
>> Thread 1 took 117.10 seconds and compiled 17966 shaders (not
>> including SIMD16) with 2154 GL context switches
>> Thread 4 took 113.76 seconds and compiled 18097 shaders (not
>> including SIMD16) with 2285 GL context switches
>> Thread 2 took 113.61 seconds and compiled 17111 shaders (not
>> including SIMD16) with 1934 GL context switches
>> Thread 6 took 118.93 seconds and compiled 17887 shaders (not
>> including SIMD16) with 2205 GL context switches
>> Thread 0 took 112.91 seconds and compiled 18232 shaders (not
>> including SIMD16) with 2321 GL context switches
>>
>> with your patch:
>> Thread 1 took 119.41 seconds and compiled 18495 shaders (not
>> including SIMD16) with 2237 GL context switches
>> Thread 7 took 122.11 seconds and compiled 17228 shaders (not
>> including SIMD16) with 2105 GL context switches
>> Thread 4 took 120.57 

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-28 Thread Gert Wollny
Am Samstag, den 28.04.2018, 22:43 +0200 schrieb Benedikt Schemmer:
> The patches apply cleanly, however I just did a shader-db test run
> and can't find a difference with your patch
> applied, am I doing something wrong?

AFAIK radeonsi doesn't use the register-merge optimizer in TGSI.

> 
> compile times went up though:
This is strange, because "see above". Did you compile with debug
information and c++11 or higher enables? In this case there is one
access to a static variable (first patch)  that per c++11 standard
should be thread save, which means that there might be a mutex
protecting access to that variable, and this would explain the longer
runtime in a multi-threaded environment.  

Best, 
Gert

> 
> before:
> Thread 3 took 113.72 seconds and compiled 17899 shaders (not
> including SIMD16) with 2232 GL context switches
> Thread 5 took 113.23 seconds and compiled 17767 shaders (not
> including SIMD16) with 2150 GL context switches
> Thread 7 took 116.63 seconds and compiled 18030 shaders (not
> including SIMD16) with 2219 GL context switches
> Thread 1 took 117.10 seconds and compiled 17966 shaders (not
> including SIMD16) with 2154 GL context switches
> Thread 4 took 113.76 seconds and compiled 18097 shaders (not
> including SIMD16) with 2285 GL context switches
> Thread 2 took 113.61 seconds and compiled 17111 shaders (not
> including SIMD16) with 1934 GL context switches
> Thread 6 took 118.93 seconds and compiled 17887 shaders (not
> including SIMD16) with 2205 GL context switches
> Thread 0 took 112.91 seconds and compiled 18232 shaders (not
> including SIMD16) with 2321 GL context switches
> 
> with your patch:
> Thread 1 took 119.41 seconds and compiled 18495 shaders (not
> including SIMD16) with 2237 GL context switches
> Thread 7 took 122.11 seconds and compiled 17228 shaders (not
> including SIMD16) with 2105 GL context switches
> Thread 4 took 120.57 seconds and compiled 17989 shaders (not
> including SIMD16) with 2165 GL context switches
> Thread 5 took 119.79 seconds and compiled 17709 shaders (not
> including SIMD16) with 2190 GL context switches
> Thread 6 took 121.95 seconds and compiled 17804 shaders (not
> including SIMD16) with 2209 GL context switches
> Thread 2 took 121.43 seconds and compiled 17819 shaders (not
> including SIMD16) with 2167 GL context switches
> Thread 0 took 117.16 seconds and compiled 18180 shaders (not
> including SIMD16) with 2151 GL context switches
> Thread 3 took 124.79 seconds and compiled 17765 shaders (not
> including SIMD16) with 2176 GL context switches
> 
> 
> 
> Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.17.0-rc2+, LLVM 6.0.0)
> 
> Mesa git of today
> 
> PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR
> SpillVGPR  PrivVGPR   Scratch  CodeSize  MaxWaves Waits
>  0ad6 . . . .
>  . . . . .
>  aer  590 . . . .
>  . . . . .
>  alien_isolation 1414 . . . .
>  . . . . .
>  anholt10 . . . .
>  . . . . .
>  bioshock_infinite   2581 . . . .
>  . . . . .
>  blackmesa584 . . . .
>  . . . . .
>  cat  573 . . . .
>  . . . . .
>  csgo1392 . . . .
>  . . . . .
>  deadisland_definitive   1776 . . . .
>  . . . . .
>  deadisland_original11602 . . . .
>  . . . . .
>  deadisland_riptide_..293 . . . .
>  . . . . .
>  deusex_mankind  5051 . . . .
>  . . . . .
>  dirtrally787 . . . .
>  . . . . .
>  dolphin   22 . . . .
>  . . . . .
>  dyinglight  4012 . . . .
>  . . . . .
>  eurotruck2   216 . . . .
>  . . . . .
>  f1_2015  746 . . . .
>  . . . . .
>  glamor16 . . . .
>  . . . . .
>  hl2ep1   294 

Re: [Mesa-dev] [PATCH v3 00/13] TGSI: improved live range tracking, also including arrays

2018-04-28 Thread Benedikt Schemmer
The patches apply cleanly, however I just did a shader-db test run and can't 
find a difference with your patch
applied, am I doing something wrong?

compile times went up though:

before:
Thread 3 took 113.72 seconds and compiled 17899 shaders (not including SIMD16) 
with 2232 GL context switches
Thread 5 took 113.23 seconds and compiled 17767 shaders (not including SIMD16) 
with 2150 GL context switches
Thread 7 took 116.63 seconds and compiled 18030 shaders (not including SIMD16) 
with 2219 GL context switches
Thread 1 took 117.10 seconds and compiled 17966 shaders (not including SIMD16) 
with 2154 GL context switches
Thread 4 took 113.76 seconds and compiled 18097 shaders (not including SIMD16) 
with 2285 GL context switches
Thread 2 took 113.61 seconds and compiled 17111 shaders (not including SIMD16) 
with 1934 GL context switches
Thread 6 took 118.93 seconds and compiled 17887 shaders (not including SIMD16) 
with 2205 GL context switches
Thread 0 took 112.91 seconds and compiled 18232 shaders (not including SIMD16) 
with 2321 GL context switches

with your patch:
Thread 1 took 119.41 seconds and compiled 18495 shaders (not including SIMD16) 
with 2237 GL context switches
Thread 7 took 122.11 seconds and compiled 17228 shaders (not including SIMD16) 
with 2105 GL context switches
Thread 4 took 120.57 seconds and compiled 17989 shaders (not including SIMD16) 
with 2165 GL context switches
Thread 5 took 119.79 seconds and compiled 17709 shaders (not including SIMD16) 
with 2190 GL context switches
Thread 6 took 121.95 seconds and compiled 17804 shaders (not including SIMD16) 
with 2209 GL context switches
Thread 2 took 121.43 seconds and compiled 17819 shaders (not including SIMD16) 
with 2167 GL context switches
Thread 0 took 117.16 seconds and compiled 18180 shaders (not including SIMD16) 
with 2151 GL context switches
Thread 3 took 124.79 seconds and compiled 17765 shaders (not including SIMD16) 
with 2176 GL context switches



Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.17.0-rc2+, LLVM 6.0.0)

Mesa git of today

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR SpillVGPR  PrivVGPR  
 Scratch  CodeSize  MaxWaves Waits
 0ad6 . . . . . 
. . . .
 aer  590 . . . . . 
. . . .
 alien_isolation 1414 . . . . . 
. . . .
 anholt10 . . . . . 
. . . .
 bioshock_infinite   2581 . . . . . 
. . . .
 blackmesa584 . . . . . 
. . . .
 cat  573 . . . . . 
. . . .
 csgo1392 . . . . . 
. . . .
 deadisland_definitive   1776 . . . . . 
. . . .
 deadisland_original11602 . . . . . 
. . . .
 deadisland_riptide_..293 . . . . . 
. . . .
 deusex_mankind  5051 . . . . . 
. . . .
 dirtrally787 . . . . . 
. . . .
 dolphin   22 . . . . . 
. . . .
 dyinglight  4012 . . . . . 
. . . .
 eurotruck2   216 . . . . . 
. . . .
 f1_2015  746 . . . . . 
. . . .
 glamor16 . . . . . 
. . . .
 hl2ep1   294 . . . . . 
. . . .
 hl2ep2   154 . . . . . 
. . . .
 hl2lostcoast  66 . . . . . 
. . . .
 hlsl3582 . . . . . 
. . . .
 humus-celshading   4 . . . . . 
. . . .
 humus-domino   6 . . . . . 
. . . .