Re: [Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Rob Clark
On Wed, Jun 6, 2018 at 8:34 PM, Ian Romanick  wrote:
> On 06/06/2018 04:47 PM, Rob Clark wrote:
>> On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick  wrote:
>>> On 06/06/2018 07:43 AM, Rob Clark wrote:
 Run this pass late (after opt loop) to move load_const instructions back
 into the basic blocks which use the result, in cases where a load_const
 is only consumed in a single block.
>>>
>>> If the load_const is used in more than one block, you could use it to
>>> the block that dominates all of the block that contain a use.  I suspect
>>> in many cases that will just be the first block, but it's probably worth
>>> trying.
>>
>> hmm, good point.. I was mostly going for the super-obvious wins with
>> now downside cases (ie. low hanging fruit), but I guess the block that
>> dominates all the uses isn't going to be worse than the first block..
>>
>> I can play around w/ the common denominator case.. although in at
>> least some cases I think it might be better to split the load_const.
>> Idk, the low hanging fruit approach cuts register usage by 1/3rd in
>> some of these edge case shaders, which by itself seems like a big
>> enough win, but I guess nir_shader_compiler_options config params is
>> an option for more aggressive approaches that may or may not be a win
>> for different drivers.
>
> True.  We could land this and incrementally improve it later.  If we
> want to do that, I'd suggest
>
>  - Put all of our ideas for improving the pass, including the one
>already documented in the code, to the file header comment.
>
>  - Remove the suggestions for future work from the "if consumed in more
>than one block, ignore." comment.

sgtm, I'll respin tomorrow.. maybe after playing around a bit w/ the
common denominator block thing (since I see some cases that seems like
it would benefit)

>
 This helps reduce register usage in cases where the backend driver
 cannot lower the load_const to a uniform.
>>>
>>> The trade off is that now you might not be able to hide the latency of
>>> the load.  I tried this on i965, and the results (below) support this
>>> idea a bit.  If you have a structured backend IR, this seems like a
>>> thing better done there... like, if you can't register allocate without
>>> spilling, try moving the load.  But that's really just a scheduling
>>> heuristic.  Hmm...
>>
>> iirc, i965 had some clever load_const lower passes so all the threads
>> in a warp could share the same load_const (iirc?  I might not have
>> that description quite right)..  so I wasn't expecting this to be
>> useful to i965 (but if it is, then, bonus!).  For me I can *almost*
>> always lower load_const to uniform, so in the common cases it hasn't
>> been a problem.  The exceptions are cases where an alu instruction
>> consume >1 const (but those get optimized away) or a const plus
>> uniform, or things like ssbo/image store.  (I noticed the register
>> usage blowing up thing looking at some deqp tests like
>> dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))
>
> There are some optimizations for loading constants in the vec4 backend.
> There is some code to lower arrays of constants to uniforms.  Uniforms
> are already shared within the SIMD group.  I don't know of anything
> other than that, but that's doesn't mean such a thing doesn't exist.
>
> Also... at what point do you call this pass?  Here's what I did for
> i965... basically called it as late as I thought was possible:
>
> diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
> index dfeea73b06a..2209d6142b3 100644
> --- a/src/intel/compiler/brw_nir.c
> +++ b/src/intel/compiler/brw_nir.c
> @@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct
> brw_compiler *compiler,
> if (devinfo->gen <= 5)
>brw_nir_analyze_boolean_resolves(nir);
>
> +   nir_move_load_const(nir);
> +

fwiw, this is basically we I called it for ir3, it is intended to be
after opt loop (or really anything that might move load_const back to
the top of the first block)..

but tbh I didn't play with it that much yet, other than seeing that it
was a big with for the pedantic cases (like the deqp test I
mentioned).  I think about the only thing I do after the
nir_move_load_const() pass is nir_lower_locals_to_regs +
nir_convert_from_ssa(phi_webs_only=true)


> nir_sweep(nir);
>
> if (unlikely(debug_enabled)) {
>
> Now that the full shader-db run is done, it seems to have some odd
> effects on vec4 shaders, so I'll probably tweak this a bit.
>
>> If latency of the load is an issue, perhaps a configuration option
>> about the minimum size of destination use_block, with the rough theory
>> that if the block is bigger than some threshold the instruction
>> scheduling could probably hide the latency?
>
> I thought about this a bit more... I think the driver's scheduler should
> fix this case.  If it can't hide the latency of a load immediate, the
> scheduler should move it to the earlier block.  That might even 

Re: [Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Ian Romanick
On 06/06/2018 04:47 PM, Rob Clark wrote:
> On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick  wrote:
>> On 06/06/2018 07:43 AM, Rob Clark wrote:
>>> Run this pass late (after opt loop) to move load_const instructions back
>>> into the basic blocks which use the result, in cases where a load_const
>>> is only consumed in a single block.
>>
>> If the load_const is used in more than one block, you could use it to
>> the block that dominates all of the block that contain a use.  I suspect
>> in many cases that will just be the first block, but it's probably worth
>> trying.
> 
> hmm, good point.. I was mostly going for the super-obvious wins with
> now downside cases (ie. low hanging fruit), but I guess the block that
> dominates all the uses isn't going to be worse than the first block..
> 
> I can play around w/ the common denominator case.. although in at
> least some cases I think it might be better to split the load_const.
> Idk, the low hanging fruit approach cuts register usage by 1/3rd in
> some of these edge case shaders, which by itself seems like a big
> enough win, but I guess nir_shader_compiler_options config params is
> an option for more aggressive approaches that may or may not be a win
> for different drivers.

True.  We could land this and incrementally improve it later.  If we
want to do that, I'd suggest

 - Put all of our ideas for improving the pass, including the one
   already documented in the code, to the file header comment.

 - Remove the suggestions for future work from the "if consumed in more
   than one block, ignore." comment.

>>> This helps reduce register usage in cases where the backend driver
>>> cannot lower the load_const to a uniform.
>>
>> The trade off is that now you might not be able to hide the latency of
>> the load.  I tried this on i965, and the results (below) support this
>> idea a bit.  If you have a structured backend IR, this seems like a
>> thing better done there... like, if you can't register allocate without
>> spilling, try moving the load.  But that's really just a scheduling
>> heuristic.  Hmm...
> 
> iirc, i965 had some clever load_const lower passes so all the threads
> in a warp could share the same load_const (iirc?  I might not have
> that description quite right)..  so I wasn't expecting this to be
> useful to i965 (but if it is, then, bonus!).  For me I can *almost*
> always lower load_const to uniform, so in the common cases it hasn't
> been a problem.  The exceptions are cases where an alu instruction
> consume >1 const (but those get optimized away) or a const plus
> uniform, or things like ssbo/image store.  (I noticed the register
> usage blowing up thing looking at some deqp tests like
> dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))

There are some optimizations for loading constants in the vec4 backend.
There is some code to lower arrays of constants to uniforms.  Uniforms
are already shared within the SIMD group.  I don't know of anything
other than that, but that's doesn't mean such a thing doesn't exist.

Also... at what point do you call this pass?  Here's what I did for
i965... basically called it as late as I thought was possible:

diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index dfeea73b06a..2209d6142b3 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct
brw_compiler *compiler,
if (devinfo->gen <= 5)
   brw_nir_analyze_boolean_resolves(nir);

+   nir_move_load_const(nir);
+
nir_sweep(nir);

if (unlikely(debug_enabled)) {

Now that the full shader-db run is done, it seems to have some odd
effects on vec4 shaders, so I'll probably tweak this a bit.

> If latency of the load is an issue, perhaps a configuration option
> about the minimum size of destination use_block, with the rough theory
> that if the block is bigger than some threshold the instruction
> scheduling could probably hide the latency?

I thought about this a bit more... I think the driver's scheduler should
fix this case.  If it can't hide the latency of a load immediate, the
scheduler should move it to the earlier block.  That might even help
hide the latency of a compare.  Something like

cmp.g.f0(16)nullg2<8,8,1>F  -g9.4<0,1,0>F
(+f0) if(16)JIP: 40 UIP: 40
   END B0 ->B1 ->B2
   START B1 <-B0 (40 cycles)
mov(16) g77<1>F 1F

should become

cmp.g.f0(16)nullg2<8,8,1>F  -g9.4<0,1,0>F
mov(16) g77<1>F 1F
(+f0) if(16)JIP: 40 UIP: 40
   END B0 ->B1 ->B2
   START B1 <-B0 (40 cycles)

> BR,
> -R
> 
> 
>> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
>> instructions in affected programs: 134856 -> 134364 (-0.36%)
>> helped: 53
>> HURT: 10
>> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
>> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
>> HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
>> 

Re: [Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Rob Clark
On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick  wrote:
> On 06/06/2018 07:43 AM, Rob Clark wrote:
>> Run this pass late (after opt loop) to move load_const instructions back
>> into the basic blocks which use the result, in cases where a load_const
>> is only consumed in a single block.
>
> If the load_const is used in more than one block, you could use it to
> the block that dominates all of the block that contain a use.  I suspect
> in many cases that will just be the first block, but it's probably worth
> trying.

hmm, good point.. I was mostly going for the super-obvious wins with
now downside cases (ie. low hanging fruit), but I guess the block that
dominates all the uses isn't going to be worse than the first block..

I can play around w/ the common denominator case.. although in at
least some cases I think it might be better to split the load_const.
Idk, the low hanging fruit approach cuts register usage by 1/3rd in
some of these edge case shaders, which by itself seems like a big
enough win, but I guess nir_shader_compiler_options config params is
an option for more aggressive approaches that may or may not be a win
for different drivers.

>
>> This helps reduce register usage in cases where the backend driver
>> cannot lower the load_const to a uniform.
>
> The trade off is that now you might not be able to hide the latency of
> the load.  I tried this on i965, and the results (below) support this
> idea a bit.  If you have a structured backend IR, this seems like a
> thing better done there... like, if you can't register allocate without
> spilling, try moving the load.  But that's really just a scheduling
> heuristic.  Hmm...
>

iirc, i965 had some clever load_const lower passes so all the threads
in a warp could share the same load_const (iirc?  I might not have
that description quite right)..  so I wasn't expecting this to be
useful to i965 (but if it is, then, bonus!).  For me I can *almost*
always lower load_const to uniform, so in the common cases it hasn't
been a problem.  The exceptions are cases where an alu instruction
consume >1 const (but those get optimized away) or a const plus
uniform, or things like ssbo/image store.  (I noticed the register
usage blowing up thing looking at some deqp tests like
dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-))

If latency of the load is an issue, perhaps a configuration option
about the minimum size of destination use_block, with the rough theory
that if the block is bigger than some threshold the instruction
scheduling could probably hide the latency?

BR,
-R


> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
> instructions in affected programs: 134856 -> 134364 (-0.36%)
> helped: 53
> HURT: 10
> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
> HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
> HURT stats (rel)   min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
> 95% mean confidence interval for instructions value: -9.61 -6.01
> 95% mean confidence interval for instructions %-change: -0.68% -0.38%
> Instructions are helped.
>
> total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
> cycles in affected programs: 260693023 -> 260594552 (-0.04%)
> helped: 122
> HURT: 88
> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
> HURT stats (abs)   min: 1 max: 1884 x̄: 378.69 x̃: 29
> HURT stats (rel)   min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
> 95% mean confidence interval for cycles value: -943.71 5.89
> 95% mean confidence interval for cycles %-change: -1.71% -0.32%
> Inconclusive result (value mean confidence interval includes 0).
>
> total spills in shared programs: 8044 -> 7975 (-0.86%)
> spills in affected programs: 2391 -> 2322 (-2.89%)
> helped: 42
> HURT: 0
>
> total fills in shared programs: 10950 -> 10884 (-0.60%)
> fills in affected programs: 2999 -> 2933 (-2.20%)
> helped: 42
> HURT: 0
>
> LOST:   4
> GAINED: 7
>
>> Signed-off-by: Rob Clark 
>> ---
>>  src/compiler/Makefile.sources  |   1 +
>>  src/compiler/nir/meson.build   |   1 +
>>  src/compiler/nir/nir.h |   1 +
>>  src/compiler/nir/nir_move_load_const.c | 131 +
>>  4 files changed, 134 insertions(+)
>>  create mode 100644 src/compiler/nir/nir_move_load_const.c
>>
>> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
>> index 3daa2c51334..cb8a2875a84 100644
>> --- a/src/compiler/Makefile.sources
>> +++ b/src/compiler/Makefile.sources
>> @@ -253,6 +253,7 @@ NIR_FILES = \
>>   nir/nir_lower_wpos_center.c \
>>   nir/nir_lower_wpos_ytransform.c \
>>   nir/nir_metadata.c \
>> + nir/nir_move_load_const.c \
>>   nir/nir_move_vec_src_uses_to_dest.c \
>>   nir/nir_normalize_cubemap_coords.c \
>>   nir/nir_opt_conditional_discard.c \
>> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
>> index 

Re: [Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Ian Romanick
On 06/06/2018 03:51 PM, Ian Romanick wrote:
> On 06/06/2018 07:43 AM, Rob Clark wrote:
>> Run this pass late (after opt loop) to move load_const instructions back
>> into the basic blocks which use the result, in cases where a load_const
>> is only consumed in a single block.
> 
> If the load_const is used in more than one block, you could use it to
> the block that dominates all of the block that contain a use.  I suspect
> in many cases that will just be the first block, but it's probably worth
> trying.
> 
>> This helps reduce register usage in cases where the backend driver
>> cannot lower the load_const to a uniform.
> 
> The trade off is that now you might not be able to hide the latency of
> the load.  I tried this on i965, and the results (below) support this
> idea a bit.  If you have a structured backend IR, this seems like a
> thing better done there... like, if you can't register allocate without
> spilling, try moving the load.  But that's really just a scheduling
> heuristic.  Hmm...
> 
> total instructions in shared programs: 14398410 -> 14397918 (<.01%)
> instructions in affected programs: 134856 -> 134364 (-0.36%)
> helped: 53
> HURT: 10
> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
> HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
> HURT stats (rel)   min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
> 95% mean confidence interval for instructions value: -9.61 -6.01
> 95% mean confidence interval for instructions %-change: -0.68% -0.38%
> Instructions are helped.
> 
> total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
> cycles in affected programs: 260693023 -> 260594552 (-0.04%)
> helped: 122
> HURT: 88
> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
> HURT stats (abs)   min: 1 max: 1884 x̄: 378.69 x̃: 29
> HURT stats (rel)   min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
> 95% mean confidence interval for cycles value: -943.71 5.89
> 95% mean confidence interval for cycles %-change: -1.71% -0.32%
> Inconclusive result (value mean confidence interval includes 0).
> 
> total spills in shared programs: 8044 -> 7975 (-0.86%)
> spills in affected programs: 2391 -> 2322 (-2.89%)
> helped: 42
> HURT: 0
> 
> total fills in shared programs: 10950 -> 10884 (-0.60%)
> fills in affected programs: 2999 -> 2933 (-2.20%)
> helped: 42
> HURT: 0
> 
> LOST:   4
> GAINED: 7

This lost result is *really* misleading.  The 4 lost shaders are compute
shaders that went from SIMD8 to SIMD16.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Ian Romanick
On 06/06/2018 07:43 AM, Rob Clark wrote:
> Run this pass late (after opt loop) to move load_const instructions back
> into the basic blocks which use the result, in cases where a load_const
> is only consumed in a single block.

If the load_const is used in more than one block, you could use it to
the block that dominates all of the block that contain a use.  I suspect
in many cases that will just be the first block, but it's probably worth
trying.

> This helps reduce register usage in cases where the backend driver
> cannot lower the load_const to a uniform.

The trade off is that now you might not be able to hide the latency of
the load.  I tried this on i965, and the results (below) support this
idea a bit.  If you have a structured backend IR, this seems like a
thing better done there... like, if you can't register allocate without
spilling, try moving the load.  But that's really just a scheduling
heuristic.  Hmm...

total instructions in shared programs: 14398410 -> 14397918 (<.01%)
instructions in affected programs: 134856 -> 134364 (-0.36%)
helped: 53
HURT: 10
helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7
helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40%
HURT stats (abs)   min: 1 max: 1 x̄: 1.00 x̃: 1
HURT stats (rel)   min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03%
95% mean confidence interval for instructions value: -9.61 -6.01
95% mean confidence interval for instructions %-change: -0.68% -0.38%
Instructions are helped.

total cycles in shared programs: 532949823 -> 532851352 (-0.02%)
cycles in affected programs: 260693023 -> 260594552 (-0.04%)
helped: 122
HURT: 88
helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20
helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32%
HURT stats (abs)   min: 1 max: 1884 x̄: 378.69 x̃: 29
HURT stats (rel)   min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09%
95% mean confidence interval for cycles value: -943.71 5.89
95% mean confidence interval for cycles %-change: -1.71% -0.32%
Inconclusive result (value mean confidence interval includes 0).

total spills in shared programs: 8044 -> 7975 (-0.86%)
spills in affected programs: 2391 -> 2322 (-2.89%)
helped: 42
HURT: 0

total fills in shared programs: 10950 -> 10884 (-0.60%)
fills in affected programs: 2999 -> 2933 (-2.20%)
helped: 42
HURT: 0

LOST:   4
GAINED: 7

> Signed-off-by: Rob Clark 
> ---
>  src/compiler/Makefile.sources  |   1 +
>  src/compiler/nir/meson.build   |   1 +
>  src/compiler/nir/nir.h |   1 +
>  src/compiler/nir/nir_move_load_const.c | 131 +
>  4 files changed, 134 insertions(+)
>  create mode 100644 src/compiler/nir/nir_move_load_const.c
> 
> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
> index 3daa2c51334..cb8a2875a84 100644
> --- a/src/compiler/Makefile.sources
> +++ b/src/compiler/Makefile.sources
> @@ -253,6 +253,7 @@ NIR_FILES = \
>   nir/nir_lower_wpos_center.c \
>   nir/nir_lower_wpos_ytransform.c \
>   nir/nir_metadata.c \
> + nir/nir_move_load_const.c \
>   nir/nir_move_vec_src_uses_to_dest.c \
>   nir/nir_normalize_cubemap_coords.c \
>   nir/nir_opt_conditional_discard.c \
> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
> index 3fec363691d..373a47fd155 100644
> --- a/src/compiler/nir/meson.build
> +++ b/src/compiler/nir/meson.build
> @@ -144,6 +144,7 @@ files_libnir = files(
>'nir_lower_wpos_ytransform.c',
>'nir_lower_bit_size.c',
>'nir_metadata.c',
> +  'nir_move_load_const.c',
>'nir_move_vec_src_uses_to_dest.c',
>'nir_normalize_cubemap_coords.c',
>'nir_opt_conditional_discard.c',
> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
> index 5a1f79515ad..073ab4e82ea 100644
> --- a/src/compiler/nir/nir.h
> +++ b/src/compiler/nir/nir.h
> @@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, 
> nir_variable_mode modes);
>  bool nir_lower_constant_initializers(nir_shader *shader,
>   nir_variable_mode modes);
>  
> +bool nir_move_load_const(nir_shader *shader);
>  bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
>  bool nir_lower_vec_to_movs(nir_shader *shader);
>  void nir_lower_alpha_test(nir_shader *shader, enum compare_func func,
> diff --git a/src/compiler/nir/nir_move_load_const.c 
> b/src/compiler/nir/nir_move_load_const.c
> new file mode 100644
> index 000..c0750035fbb
> --- /dev/null
> +++ b/src/compiler/nir/nir_move_load_const.c
> @@ -0,0 +1,131 @@
> +/*
> + * Copyright © 2018 Red Hat
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject 

[Mesa-dev] [PATCH 1/2] nir: add pass to move load_const

2018-06-06 Thread Rob Clark
Run this pass late (after opt loop) to move load_const instructions back
into the basic blocks which use the result, in cases where a load_const
is only consumed in a single block.

This helps reduce register usage in cases where the backend driver
cannot lower the load_const to a uniform.

Signed-off-by: Rob Clark 
---
 src/compiler/Makefile.sources  |   1 +
 src/compiler/nir/meson.build   |   1 +
 src/compiler/nir/nir.h |   1 +
 src/compiler/nir/nir_move_load_const.c | 131 +
 4 files changed, 134 insertions(+)
 create mode 100644 src/compiler/nir/nir_move_load_const.c

diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
index 3daa2c51334..cb8a2875a84 100644
--- a/src/compiler/Makefile.sources
+++ b/src/compiler/Makefile.sources
@@ -253,6 +253,7 @@ NIR_FILES = \
nir/nir_lower_wpos_center.c \
nir/nir_lower_wpos_ytransform.c \
nir/nir_metadata.c \
+   nir/nir_move_load_const.c \
nir/nir_move_vec_src_uses_to_dest.c \
nir/nir_normalize_cubemap_coords.c \
nir/nir_opt_conditional_discard.c \
diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build
index 3fec363691d..373a47fd155 100644
--- a/src/compiler/nir/meson.build
+++ b/src/compiler/nir/meson.build
@@ -144,6 +144,7 @@ files_libnir = files(
   'nir_lower_wpos_ytransform.c',
   'nir_lower_bit_size.c',
   'nir_metadata.c',
+  'nir_move_load_const.c',
   'nir_move_vec_src_uses_to_dest.c',
   'nir_normalize_cubemap_coords.c',
   'nir_opt_conditional_discard.c',
diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
index 5a1f79515ad..073ab4e82ea 100644
--- a/src/compiler/nir/nir.h
+++ b/src/compiler/nir/nir.h
@@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, 
nir_variable_mode modes);
 bool nir_lower_constant_initializers(nir_shader *shader,
  nir_variable_mode modes);
 
+bool nir_move_load_const(nir_shader *shader);
 bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
 bool nir_lower_vec_to_movs(nir_shader *shader);
 void nir_lower_alpha_test(nir_shader *shader, enum compare_func func,
diff --git a/src/compiler/nir/nir_move_load_const.c 
b/src/compiler/nir/nir_move_load_const.c
new file mode 100644
index 000..c0750035fbb
--- /dev/null
+++ b/src/compiler/nir/nir_move_load_const.c
@@ -0,0 +1,131 @@
+/*
+ * Copyright © 2018 Red Hat
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *Rob Clark (robdcl...@gmail.com>
+ *
+ */
+
+#include "nir.h"
+
+
+/*
+ * A simple pass that moves load_const's into consuming block if
+ * they are only consumed in a single block, to try to counter-
+ * act nir's tendency to move all load_const to the top of the
+ * first block.
+ */
+
+/* iterate a ssa def's use's and if all uses are within a single
+ * block, return it, otherwise return null.
+ */
+static nir_block *
+get_single_block(nir_ssa_def *def)
+{
+   nir_block *b = NULL;
+
+   /* hmm, probably ignore if-uses: */
+   if (!list_empty(>if_uses))
+  return NULL;
+
+   nir_foreach_use(use, def) {
+  nir_instr *instr = use->parent_instr;
+
+  /*
+   * Kind of an ugly special-case, but phi instructions
+   * need to appear first in the block, so by definition
+   * we can't move a load_immed into a block where it is
+   * consumed by a phi instruction.  We could conceivably
+   * move it into a dominator block.
+   */
+  if (instr->type == nir_instr_type_phi)
+ return NULL;
+
+  nir_block *use_block = instr->block;
+
+  if (!b) {
+ b = use_block;
+  } else if (b != use_block) {
+ return NULL;
+  }
+   }
+
+   return b;
+}
+
+bool
+nir_move_load_const(nir_shader *shader)
+{
+   bool progress = false;
+
+   nir_foreach_function(function, shader) {
+  if (!function->impl)
+ continue;
+
+