Re: [Mesa-dev] [PATCH 3/5] i965/ir: Skip eliminate_find_live_channel() for stages with sparse thread dispatch.

2016-09-16 Thread Francisco Jerez
Jason Ekstrand  writes:

> On Fri, Sep 16, 2016 at 3:03 PM, Francisco Jerez 
> wrote:
>
>> The eliminate_find_live_channel optimization eliminates
>> FIND_LIVE_CHANNEL instructions in cases where control flow is known to
>> be uniform, and replaces them with 'MOV 0', which in turn unblocks
>> subsequent elimination of the BROADCAST instruction frequently used on
>> the result of FIND_LIVE_CHANNEL.  This is however not correct in
>> per-sample fragment shader dispatch because the PSD can dispatch a
>> fully unlit sample under certain conditions.  Disable the optimization
>> in that case.
>> ---
>>  src/mesa/drivers/dri/i965/brw_compiler.h | 41
>> 
>>  src/mesa/drivers/dri/i965/brw_fs.cpp |  8 +++
>>  src/mesa/drivers/dri/i965/brw_vec4.cpp   |  8 +++
>>  3 files changed, 57 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h
>> b/src/mesa/drivers/dri/i965/brw_compiler.h
>> index 84d3dde..1429875 100644
>> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
>> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
>> @@ -868,6 +868,47 @@ encode_slm_size(unsigned gen, uint32_t bytes)
>> return slm_size;
>>  }
>>
>> +/**
>> + * Return true if the given shader stage is dispatched contiguously by the
>> + * relevant fixed function starting from channel 0 of the SIMD thread,
>> which
>> + * implies that the dispatch mask of a thread can be assumed to have the
>> form
>> + * '2^n - 1' for some n.
>> + */
>> +static inline bool
>> +brw_stage_has_packed_dispatch(gl_shader_stage stage,
>> +  const struct brw_stage_prog_data *prog_data)
>>
>
> Thank you, thank you, thank you for making this a well-documented helper
> function!
>

Given the amount of hardware documentation about this, any documentation
is too little documentation. ;)

>
>> +{
>> +   switch (stage) {
>> +   case MESA_SHADER_FRAGMENT: {
>> +  /* The PSD discards subspans coming in with no lit samples, which
>> in the
>> +   * per-pixel shading case implies that each subspan will either be
>> fully
>> +   * lit (due to the VMask being used to allow derivative
>> computations),
>> +   * or not dispatched at all.  In per-sample dispatch mode individual
>> +   * samples from the same subspan have a fixed relative location
>> within
>> +   * the SIMD thread, so dispatch of unlit samples cannot be avoided
>> in
>> +   * general and we should return false.
>> +   */
>> +  const struct brw_wm_prog_data *wm_prog_data =
>> + (const struct brw_wm_prog_data *)prog_data;
>> +  return !wm_prog_data->persample_dispatch;
>> +   }
>> +   case MESA_SHADER_COMPUTE:
>> +  /* Compute shaders will be spawned with either a fully enabled
>> dispatch
>> +   * mask or with whatever bottom/right execution mask was given to
>> the
>> +   * GPGPU walker command to be used along the workgroup edges -- In
>> both
>> +   * cases the dispatch mask is required to be tightly packed for our
>> +   * invocation index calculations to work.
>> +   */
>> +  return true;
>> +   default:
>> +  /* Most remaining fixed functions are limited to use a packed
>> dispatch
>> +   * mask due to the hardware representation of the dispatch mask as a
>> +   * single counter representing the number of enabled channels.
>> +   */
>> +  return true;
>> +   }
>> +}
>> +
>>  #ifdef __cplusplus
>>  } /* extern "C" */
>>  #endif
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index bb65077..32f7ae2 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -2835,6 +2835,14 @@ fs_visitor::eliminate_find_live_channel()
>> bool progress = false;
>> unsigned depth = 0;
>>
>> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
>> +  /* The optimization below assumes that channel zero is live on
>> thread
>> +   * dispatch, which may not be the case if the fixed function
>> dispatches
>> +   * threads sparsely.
>> +   */
>> +  return progress;
>>
>
> Maybe just return false?
>

Sure, don't have a strong preference, changed locally.

>
>> +   }
>> +
>> foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>switch (inst->opcode) {
>>case BRW_OPCODE_IF:
>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> index 58c8a8a..d5bb82b 100644
>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
>> @@ -1291,6 +1291,14 @@ vec4_visitor::eliminate_find_live_channel()
>> bool progress = false;
>> unsigned depth = 0;
>>
>> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
>> +  /* The optimization below assumes that channel zero is live on
>> thread
>> +   * dispatch, which may not be the case if the fixed function
>> dispatches
>> +   * threads sparsely.
>> +   */
>> + 

Re: [Mesa-dev] [PATCH 3/5] i965/ir: Skip eliminate_find_live_channel() for stages with sparse thread dispatch.

2016-09-16 Thread Jason Ekstrand
On Fri, Sep 16, 2016 at 3:03 PM, Francisco Jerez 
wrote:

> The eliminate_find_live_channel optimization eliminates
> FIND_LIVE_CHANNEL instructions in cases where control flow is known to
> be uniform, and replaces them with 'MOV 0', which in turn unblocks
> subsequent elimination of the BROADCAST instruction frequently used on
> the result of FIND_LIVE_CHANNEL.  This is however not correct in
> per-sample fragment shader dispatch because the PSD can dispatch a
> fully unlit sample under certain conditions.  Disable the optimization
> in that case.
> ---
>  src/mesa/drivers/dri/i965/brw_compiler.h | 41
> 
>  src/mesa/drivers/dri/i965/brw_fs.cpp |  8 +++
>  src/mesa/drivers/dri/i965/brw_vec4.cpp   |  8 +++
>  3 files changed, 57 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_compiler.h
> b/src/mesa/drivers/dri/i965/brw_compiler.h
> index 84d3dde..1429875 100644
> --- a/src/mesa/drivers/dri/i965/brw_compiler.h
> +++ b/src/mesa/drivers/dri/i965/brw_compiler.h
> @@ -868,6 +868,47 @@ encode_slm_size(unsigned gen, uint32_t bytes)
> return slm_size;
>  }
>
> +/**
> + * Return true if the given shader stage is dispatched contiguously by the
> + * relevant fixed function starting from channel 0 of the SIMD thread,
> which
> + * implies that the dispatch mask of a thread can be assumed to have the
> form
> + * '2^n - 1' for some n.
> + */
> +static inline bool
> +brw_stage_has_packed_dispatch(gl_shader_stage stage,
> +  const struct brw_stage_prog_data *prog_data)
>

Thank you, thank you, thank you for making this a well-documented helper
function!


> +{
> +   switch (stage) {
> +   case MESA_SHADER_FRAGMENT: {
> +  /* The PSD discards subspans coming in with no lit samples, which
> in the
> +   * per-pixel shading case implies that each subspan will either be
> fully
> +   * lit (due to the VMask being used to allow derivative
> computations),
> +   * or not dispatched at all.  In per-sample dispatch mode individual
> +   * samples from the same subspan have a fixed relative location
> within
> +   * the SIMD thread, so dispatch of unlit samples cannot be avoided
> in
> +   * general and we should return false.
> +   */
> +  const struct brw_wm_prog_data *wm_prog_data =
> + (const struct brw_wm_prog_data *)prog_data;
> +  return !wm_prog_data->persample_dispatch;
> +   }
> +   case MESA_SHADER_COMPUTE:
> +  /* Compute shaders will be spawned with either a fully enabled
> dispatch
> +   * mask or with whatever bottom/right execution mask was given to
> the
> +   * GPGPU walker command to be used along the workgroup edges -- In
> both
> +   * cases the dispatch mask is required to be tightly packed for our
> +   * invocation index calculations to work.
> +   */
> +  return true;
> +   default:
> +  /* Most remaining fixed functions are limited to use a packed
> dispatch
> +   * mask due to the hardware representation of the dispatch mask as a
> +   * single counter representing the number of enabled channels.
> +   */
> +  return true;
> +   }
> +}
> +
>  #ifdef __cplusplus
>  } /* extern "C" */
>  #endif
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index bb65077..32f7ae2 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -2835,6 +2835,14 @@ fs_visitor::eliminate_find_live_channel()
> bool progress = false;
> unsigned depth = 0;
>
> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
> +  /* The optimization below assumes that channel zero is live on
> thread
> +   * dispatch, which may not be the case if the fixed function
> dispatches
> +   * threads sparsely.
> +   */
> +  return progress;
>

Maybe just return false?


> +   }
> +
> foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>switch (inst->opcode) {
>case BRW_OPCODE_IF:
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> index 58c8a8a..d5bb82b 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
> @@ -1291,6 +1291,14 @@ vec4_visitor::eliminate_find_live_channel()
> bool progress = false;
> unsigned depth = 0;
>
> +   if (!brw_stage_has_packed_dispatch(stage, stage_prog_data)) {
> +  /* The optimization below assumes that channel zero is live on
> thread
> +   * dispatch, which may not be the case if the fixed function
> dispatches
> +   * threads sparsely.
> +   */
> +  return progress;
>

Same here.


> +   }
> +
> foreach_block_and_inst_safe(block, vec4_instruction, inst, cfg) {
>switch (inst->opcode) {
>case BRW_OPCODE_IF:
> --
> 2.9.0
>
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedeskto