Re: [Mesa-dev] [Mesa-stable] [PATCH 2/3] glsl: TCS outputs can not be transform feedback candidates on GLES

2019-03-14 Thread Chema Casanova
On 13/3/19 23:17, Emil Velikov wrote:
> Hi Jose,
> 
> On Wed, 21 Nov 2018 at 18:45, Jose Maria Casanova Crespo
>  wrote:
>>
>> Fixes: 
>> KHR-GLES*.core.tessellation_shader.single.xfb_captures_data_from_correct_stage
>>
> This and the follow-up patch "glsl: fix recording of variables for XFB
> in TCS shaders" are explicitly marked as 19.0 only.
> As such I've omitted them from 18.3, let me know if you prefer to include 
> them.

I've checked and both patches apply clearly on 18.3 and they fix the CTS
failure there. So I think it is useful to include them.

Thanks for checking,

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 1/3] glsl: XFB TSC per-vertex output varyings match as not declared as arrays

2019-02-22 Thread Chema Casanova
V2 of this series that address the fix for
KHR-GL*.tessellation_shader.single.xfb_captures_data_from_correct_stage
regressions is available at Gitlab MR!300

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/300

On 21/11/18 19:45, Jose Maria Casanova Crespo wrote:
> Recent change on OpenGL CTS ("Use non-arrayed varying name for TCS blocks")
> on KHR-GL*.tessellation_shader.single.xfb_captures_data_from_correct_stage
> tests changed how to name per-vertex Tessellation Control Shader output
> varyings in transform feedback using interface block as "BLOCK_INOUT.value"
> rather than "BLOCK_INOUT[0].value"
> 
> So Tessellation control shader per-vertex output variables and blocks that
> are required to be declared as arrays, with each element representing output
> values for a single vertex of a multi-vertex primitive are expected to be
> named as they were not declared as arrays.
> 
> This patch adds a new is_xfb_per_vertex_output flag at ir_variable level so
> we mark when an ir_variable is an per-vertex TCS output varying. So we
> treat it in terms on XFB its naming as a non array variable.
> 
> As we don't support NV_gpu_shader5, so PATCHES mode is not accepted as
> primitiveMode parameter of BeginTransformFeedback the test expects a
> failure as we can use the XFB results.
> 
> This patch uncovers that we were passing the GLES version of the tests
> because candidates naming didn't match, not because on GLES the Tessellation
> Control stage varyings shouldn't be XFB candidates in any case. This
> is addressed in the following patch.
> 
> Fixes: 
> KHR-GL4*.tessellation_shader.single.xfb_captures_data_from_correct_stage
> 
> Cc: mesa-sta...@lists.freedesktop.org
> ---
>  src/compiler/glsl/ir.cpp| 1 +
>  src/compiler/glsl/ir.h  | 6 ++
>  src/compiler/glsl/link_uniforms.cpp | 6 --
>  src/compiler/glsl/link_varyings.cpp | 8 +++-
>  4 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/src/compiler/glsl/ir.cpp b/src/compiler/glsl/ir.cpp
> index 1d1a56ae9a5..582111d71f5 100644
> --- a/src/compiler/glsl/ir.cpp
> +++ b/src/compiler/glsl/ir.cpp
> @@ -1750,6 +1750,7 @@ ir_variable::ir_variable(const struct glsl_type *type, 
> const char *name,
> this->data.fb_fetch_output = false;
> this->data.bindless = false;
> this->data.bound = false;
> +   this->data.is_xfb_per_vertex_output = false;
>  
> if (type != NULL) {
>if (type->is_interface())
> diff --git a/src/compiler/glsl/ir.h b/src/compiler/glsl/ir.h
> index f478b29a6b5..e09f053b77c 100644
> --- a/src/compiler/glsl/ir.h
> +++ b/src/compiler/glsl/ir.h
> @@ -766,6 +766,12 @@ public:
> */
>unsigned is_xfb_only:1;
>  
> +  /**
> +   * Is this varying a TSC per-vertex output candidate for transform
> +   * feedback?
> +   */
> +  unsigned is_xfb_per_vertex_output:1;
> +
>/**
> * Was a transfor feedback buffer set in the shader?
> */
> diff --git a/src/compiler/glsl/link_uniforms.cpp 
> b/src/compiler/glsl/link_uniforms.cpp
> index 63e688b19a7..547da68e216 100644
> --- a/src/compiler/glsl/link_uniforms.cpp
> +++ b/src/compiler/glsl/link_uniforms.cpp
> @@ -72,8 +72,10 @@ program_resource_visitor::process(ir_variable *var, bool 
> use_std430_as_default)
>   get_internal_ifc_packing(use_std430_as_default) :
>var->type->get_internal_ifc_packing(use_std430_as_default);
>  
> -   const glsl_type *t =
> -  var->data.from_named_ifc_block ? var->get_interface_type() : var->type;
> +   const glsl_type *t = var->data.from_named_ifc_block ?
> +  (var->data.is_xfb_per_vertex_output ?
> +   var->get_interface_type()->without_array() :
> +   var->get_interface_type()) : var->type;
> const glsl_type *t_without_array = t->without_array();
>  
> /* false is always passed for the row_major parameter to the other
> diff --git a/src/compiler/glsl/link_varyings.cpp 
> b/src/compiler/glsl/link_varyings.cpp
> index 52e493cb599..1964dcc0a22 100644
> --- a/src/compiler/glsl/link_varyings.cpp
> +++ b/src/compiler/glsl/link_varyings.cpp
> @@ -2150,7 +2150,10 @@ private:
>tfeedback_candidate *candidate
>   = rzalloc(this->mem_ctx, tfeedback_candidate);
>candidate->toplevel_var = this->toplevel_var;
> -  candidate->type = type;
> +  if (this->toplevel_var->data.is_xfb_per_vertex_output)
> + candidate->type = type->without_array();
> +  else
> + candidate->type = type;
>candidate->offset = this->varying_floats;
>_mesa_hash_table_insert(this->tfeedback_candidates,
>ralloc_strdup(this->mem_ctx, name),
> @@ -2499,6 +2502,9 @@ assign_varying_locations(struct gl_context *ctx,
>  
>   if (num_tfeedback_decls > 0) {
>  tfeedback_candidate_generator g(mem_ctx, tfeedback_candidates);
> +if (producer->Stage == MESA_SHADER_TESS_CTRL &&
> +!output_var->data.patch)
> +   

Re: [Mesa-dev] [PATCH] intel/compiler: Add a file-level description of brw_eu_validate.c

2019-01-25 Thread Chema Casanova
Acked-by: Jose Maria Casanova Crespo 

Matt, I'll include in my TODO list creating tests for sends using grf127
as destination restriction and the one about byte_raw_moves and
execution size and stride once I receive your feedback.

Chema

El 24/1/19 a las 20:53, Matt Turner escribió:
> ---
>  src/intel/compiler/brw_eu_validate.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_eu_validate.c 
> b/src/intel/compiler/brw_eu_validate.c
> index a25010b225c..7f1580a5bb3 100644
> --- a/src/intel/compiler/brw_eu_validate.c
> +++ b/src/intel/compiler/brw_eu_validate.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright © 2015 Intel Corporation
> + * Copyright © 2015-2019 Intel Corporation
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a
>   * copy of this software and associated documentation files (the "Software"),
> @@ -24,6 +24,18 @@
>  /** @file brw_eu_validate.c
>   *
>   * This file implements a pass that validates shader assembly.
> + *
> + * The restrictions implemented herein are intended to verify that 
> instructions
> + * in shader assembly do not violate restrictions documented in the graphics
> + * programming reference manuals.
> + *
> + * The restrictions are difficult for humans to quickly verify due to their
> + * complexity and abundance.
> + *
> + * It is critical that this code is thoroughly unit tested because false
> + * results it will lead developers astray, which is worse than having no
> + * validator at all. Patches to this file without corresponding unit tests 
> (in
> + * test_eu_validate.cpp) will be rejected.
>   */
>  
>  #include "brw_eu.h"
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/9] intel/compiler: relax brw_eu_validate for byte raw movs

2019-01-23 Thread Chema Casanova
El 23/1/19 a las 7:26, Matt Turner escribió:
> On Sun, Jul 8, 2018 at 5:27 PM, Jose Maria Casanova Crespo
>  wrote:
>> When the destination is a BYTE type allow raw movs
>> even if the stride is not exact multiple of destination
>> type and exec type, execution type is Word and its size is 2.
>>
>> This restriction was only allowing stride==2 destinations
>> for 8-bit types.
> 
> Super late review, obviously... it's been on my todo list but fp64 was
> taking all my time.
> 
> I can't figure this commit out. What I know:
> 
>  - byte destination
>  - raw mov (which means destination stride == 1)

I think that stride == 1 is a requirement for a raw mov based on how it
is written at the PRM KBL Vol2A, "mov - Move" (Page 1081, PDF Page 1099).

"A mov with the same source and destination type, no source modifier,
and no saturation is a raw move."

But just after this text it is a reference for stride restriction  when
destination stride is 1 for byte types.

"A packed byte destination region (B or UB type with HorzStride == 1 and
ExecSize > 1) can only be written using raw move."

>  - execution type of a byte operation is "word"
> 
> The original code
> 
>> if (exec_type_size > dst_type_size) {
>>ERROR_IF(dst_stride * dst_type_size != exec_type_size,
>> "Destination stride must be equal to the ratio of the sizes 
>> of "
>> "the execution data type to the destination type");
>> }
> 
> would not have worked for an instruction like
> 
>> mov(8)  g0<1>B  g0<8,8,1>B
> 
> But, that's okay because it didn't need to since the block right above
> it does this:
> 
>if (dst_type_is_byte) {
>   if (is_packed(exec_size * dst_stride, exec_size, dst_stride)) {
>  if (!inst_is_raw_move(devinfo, inst)) {
> ERROR("Only raw MOV supports a packed-byte destination");
> return error_msg;
>  } else {
> return (struct string){};
>  }
>   }
>}
> 
> That is, if it's a raw move, return no-error.

Packed raw MOVs.

> It would be easier to understand what you were fixing if you had added
> a unit test to test_eu_validate.cpp, or (if my suspicions are correct)
> it would have proven to you that this patch wasn't correct.

Agree. And I should have also included in the commit log an example of
the kind of operations that we would like to allow.

> Was this just something that you noticed by inspection?

This issue was found because the validator was raising errors at the
shuffle/unshuffle operations when multiple 8-bit components were
prepared to be written as one 32-bit components for example for
store_ssbo So we had operations like these:

mov(8) g9<4>Bg3<8,8,1>B{ align1 1H };
mov(8) g9.1<4>B  g3.8<8,8,1>B{ align1 1H };
mov(8) g9.2<4>B  g3.16<8,8,1>B{ align1 1H };
mov(8) g9.3<4>B  g3.24<8,8,1>B{ align1 1H };

In theses case we have not packed raw movs, and the instructions worked
perfectly fine they were tested on the HW. But reading the PRM you would
have doubts if they were really allowed because as is stated
"Destination stride must be equal to the ratio of the sizes of the
execution data type to the destination type" and the execution size of 2
for bytes makes it more complex to understand.

I suppose that the think is that in this case the PRM seems to be
under-documented, and we can consider that this is implicitly allowed.
So we need to relax the validation.

>> Reviewed-by: Jason Ekstrand 
> 
> Sigh.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/fs: Do the grf127 hack on SIMD8 instructions in SIMD16 mode

2019-01-16 Thread Chema Casanova
If Matt concerns about the validation rule are solved this is.

Reviewed-by: Jose Maria Casanova Crespo 

El 15/1/19 a las 17:58, Jason Ekstrand escribió:
> Previously, we only applied the fix to shaders with a dispatch mode of
> SIMD8 but the code it relies on for SIMD16 mode only applies to SIMD16
> instructions.  If you have a SIMD8 instruction in a SIMD16 shader,
> neither would trigger and the restriction could still be hit.
> 
> Cc: Jose Maria Casanova Crespo 
> Fixes: 232ed8980217dd "i965/fs: Register allocator shoudn't use grf127..."
> ---
>  src/intel/compiler/brw_fs_reg_allocate.cpp | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp 
> b/src/intel/compiler/brw_fs_reg_allocate.cpp
> index 5db5242452e..ec743f9b5bf 100644
> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp
> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
> @@ -667,15 +667,14 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
> * messages adding a node interference to the grf127_send_hack_node.
> * This node has a fixed asignment to grf127.
> *
> -   * We don't apply it to SIMD16 because previous code avoids any 
> register
> -   * overlap between sources and destination.
> +   * We don't apply it to SIMD16 instructions because previous code 
> avoids
> +   * any register overlap between sources and destination.
> */
>ra_set_node_reg(g, grf127_send_hack_node, 127);
> -  if (dispatch_width == 8) {
> - foreach_block_and_inst(block, fs_inst, inst, cfg) {
> -if (inst->is_send_from_grf() && inst->dst.file == VGRF)
> -   ra_add_node_interference(g, inst->dst.nr, 
> grf127_send_hack_node);
> - }
> +  foreach_block_and_inst(block, fs_inst, inst, cfg) {
> + if (inst->exec_size < 16 && inst->is_send_from_grf() &&
> + inst->dst.file == VGRF)
> +ra_add_node_interference(g, inst->dst.nr, grf127_send_hack_node);
>}
>  
>if (spilled_any_registers) {
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/fs: Do the grf127 hack on SIMD8 instructions in SIMD16 mode

2019-01-16 Thread Chema Casanova
El 16/1/19 a las 0:55, Matt Turner escribió:
> On Tue, Jan 15, 2019 at 8:58 AM Jason Ekstrand  wrote:
>>
>> Previously, we only applied the fix to shaders with a dispatch mode of
>> SIMD8 but the code it relies on for SIMD16 mode only applies to SIMD16
>> instructions.  If you have a SIMD8 instruction in a SIMD16 shader,
>> neither would trigger and the restriction could still be hit.
>>
>> Cc: Jose Maria Casanova Crespo 
>> Fixes: 232ed8980217dd "i965/fs: Register allocator shoudn't use grf127..."
>> ---
>>  src/intel/compiler/brw_fs_reg_allocate.cpp | 13 ++---
>>  1 file changed, 6 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp 
>> b/src/intel/compiler/brw_fs_reg_allocate.cpp
>> index 5db5242452e..ec743f9b5bf 100644
>> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp.
>> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
>> @@ -667,15 +667,14 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
>> spill_all)
>> * messages adding a node interference to the grf127_send_hack_node.
>> * This node has a fixed asignment to grf127.
>> *
>> -   * We don't apply it to SIMD16 because previous code avoids any 
>> register
>> -   * overlap between sources and destination.
>> +   * We don't apply it to SIMD16 instructions because previous code 
>> avoids
>> +   * any register overlap between sources and destination.
>> */
>>ra_set_node_reg(g, grf127_send_hack_node, 127);
>> -  if (dispatch_width == 8) {
>> - foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> -if (inst->is_send_from_grf() && inst->dst.file == VGRF)
>> -   ra_add_node_interference(g, inst->dst.nr, 
>> grf127_send_hack_node);
>> - }
>> +  foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> + if (inst->exec_size < 16 && inst->is_send_from_grf() &&
>> + inst->dst.file == VGRF)
>> +ra_add_node_interference(g, inst->dst.nr, 
>> grf127_send_hack_node);
>>}
>>
> 
> Did the code in brw_eu_validate.c catch the case you found?
> 
> In fact, that code looks wrong:
> 
> |  (brw_inst_dst_da_reg_nr(devinfo, inst) +
> |   brw_inst_rlen(devinfo, inst) > 127) &&
> 
> I think > should be >=. And maybe we should have a separate case
> earlier that checks that dst_nr+rlen actually fits in registers, and
> then change > to just ==. FFS :(

This restriction only applies when we have a return register for the
SEND so rlen is >= 1. If we had had an ">=" 127, we would be raising an
exception for a legal destination register for a send in the cases like
grf126 with rlen = 1.

We could agree that another way of expressing it would be using
dst_nr+rlen == 128. But in any case something over 128 would mean that
something went wrong too. :)

I agree that it also makes sense to include a general check for dst_nr +
rlen <= 128 and the same for sources would make sense, although it isn't
possible for our register allocator to assign that not existing
combination it would be safer for changes of the register classes.

> Not sure what I was thinking letting that patch through without a unit
> test. I'll do that.

I should have taken a look at test_eu_validate.cpp

Regards,

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] glsl: XFB TSC per-vertex output varyings match as not declared as arrays

2018-12-13 Thread Chema Casanova
Ping.

El 22/11/18 a las 0:28, Chema Casanova escribió:
> 
> 
> On 21/11/18 20:04, Ilia Mirkin wrote:
>> On Wed, Nov 21, 2018 at 1:45 PM Jose Maria Casanova Crespo
>>  wrote:
>>>
>>> Recent change on OpenGL CTS ("Use non-arrayed varying name for TCS blocks")
>>> on KHR-GL*.tessellation_shader.single.xfb_captures_data_from_correct_stage
>>> tests changed how to name per-vertex Tessellation Control Shader output
>>> varyings in transform feedback using interface block as "BLOCK_INOUT.value"
>>> rather than "BLOCK_INOUT[0].value"
>>>
>>> So Tessellation control shader per-vertex output variables and blocks that
>>> are required to be declared as arrays, with each element representing output
>>> values for a single vertex of a multi-vertex primitive are expected to be
>>> named as they were not declared as arrays.
>>>
>>> This patch adds a new is_xfb_per_vertex_output flag at ir_variable level so
>>> we mark when an ir_variable is an per-vertex TCS output varying. So we
>>> treat it in terms on XFB its naming as a non array variable.
>>>
>>> As we don't support NV_gpu_shader5, so PATCHES mode is not accepted as
>>> primitiveMode parameter of BeginTransformFeedback the test expects a
>>> failure as we can use the XFB results.
>>>
>>> This patch uncovers that we were passing the GLES version of the tests
>>> because candidates naming didn't match, not because on GLES the Tessellation
>>> Control stage varyings shouldn't be XFB candidates in any case. This
>>> is addressed in the following patch.
>>>
>>> Fixes: 
>>> KHR-GL4*.tessellation_shader.single.xfb_captures_data_from_correct_stage
>>>
>>> Cc: mesa-sta...@lists.freedesktop.org
>>> ---
>>>  src/compiler/glsl/ir.cpp| 1 +
>>>  src/compiler/glsl/ir.h  | 6 ++
>>>  src/compiler/glsl/link_uniforms.cpp | 6 --
>>>  src/compiler/glsl/link_varyings.cpp | 8 +++-
>>>  4 files changed, 18 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/src/compiler/glsl/ir.cpp b/src/compiler/glsl/ir.cpp
>>> index 1d1a56ae9a5..582111d71f5 100644
>>> --- a/src/compiler/glsl/ir.cpp
>>> +++ b/src/compiler/glsl/ir.cpp
>>> @@ -1750,6 +1750,7 @@ ir_variable::ir_variable(const struct glsl_type 
>>> *type, const char *name,
>>> this->data.fb_fetch_output = false;
>>> this->data.bindless = false;
>>> this->data.bound = false;
>>> +   this->data.is_xfb_per_vertex_output = false;
>>>
>>> if (type != NULL) {
>>>if (type->is_interface())
>>> diff --git a/src/compiler/glsl/ir.h b/src/compiler/glsl/ir.h
>>> index f478b29a6b5..e09f053b77c 100644
>>> --- a/src/compiler/glsl/ir.h
>>> +++ b/src/compiler/glsl/ir.h
>>> @@ -766,6 +766,12 @@ public:
>>> */
>>>unsigned is_xfb_only:1;
>>>
>>> +  /**
>>> +   * Is this varying a TSC per-vertex output candidate for transform
>>
>> TCS?
> 
> 
> Yes. I've fixed it locally at the commit summary too.
> 
> 
>>> +   * feedback?
>>> +   */
>>> +  unsigned is_xfb_per_vertex_output:1;
>>> +
>>>/**
>>> * Was a transfor feedback buffer set in the shader?
>>
>> ugh, not your problem, but "transform" :(
>>
>>> */
>>> diff --git a/src/compiler/glsl/link_uniforms.cpp 
>>> b/src/compiler/glsl/link_uniforms.cpp
>>> index 63e688b19a7..547da68e216 100644
>>> --- a/src/compiler/glsl/link_uniforms.cpp
>>> +++ b/src/compiler/glsl/link_uniforms.cpp
>>> @@ -72,8 +72,10 @@ program_resource_visitor::process(ir_variable *var, bool 
>>> use_std430_as_default)
>>>   get_internal_ifc_packing(use_std430_as_default) :
>>>var->type->get_internal_ifc_packing(use_std430_as_default);
>>>
>>> -   const glsl_type *t =
>>> -  var->data.from_named_ifc_block ? var->get_interface_type() : 
>>> var->type;
>>> +   const glsl_type *t = var->data.from_named_ifc_block ?
>>> +  (var->data.is_xfb_per_vertex_output ?
>>> +   var->get_interface_type()->without_array() :
>>> +   var->get_interface_type()) : var->type;
>>> const glsl_type *t_without_array = t->without_array();
>>>
>>> /* false is always passed for th

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-13 Thread Chema Casanova


El 13/12/18 a las 11:49, Pohjolainen, Topi escribió:
> On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
>> On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
>>> On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
 On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> wrote:
>> On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga
>> wrote:
>>> This function is used in two different scenarios that for 32-
>>> bit
>>> instructions are the same, but for 16-bit instructions are
>>> not.
>>>
>>> One scenario is that in which we are working at a SIMD8
>>> register
>>> level and we need to know if a register is fully defined or
>>> written.
>>> This is useful, for example, in the context of liveness
>>> analysis
>>> or
>>> register allocation, where we work with units of registers.
>>>
>>> The other scenario is that in which we want to know if an
>>> instruction
>>> is writing a full scalar component or just some subset of it.
>>> This is
>>> useful, for example, in the context of some optimization
>>> passes
>>> like copy propagation.
>>>
>>> For 32-bit instructions (or larger), a SIMD8 dispatch will
>>> always
>>> write
>>> at least a full SIMD8 register (32B) if the write is not
>>> partial.
>>> The
>>> function is_partial_write() checks this to determine if we
>>> have a
>>> partial
>>> write. However, when we deal with 16-bit instructions, that
>>> logic
>>> disables
>>> some optimizations that should be safe. For example, a SIMD8
>>> 16-
>>> bit MOV will
>>> only update half of a SIMD register, but it is still a
>>> complete
>>> write of the
>>> variable for a SIMD8 dispatch, so we should not prevent copy
>>> propagation in
>>> this scenario because we don't write all 32 bytes in the SIMD
>>> register
>>> or because the write starts at offset 16B (wehere we pack
>>> components Y or
>>> W of 16-bit vectors).
>>>
>>> This is a problem for SIMD8 executions (VS, TCS, TES, GS) of
>>> 16-
>>> bit
>>> instructions, which lose a number of optimizations because of
>>> this, most
>>> important of which is copy-propagation.
>>>
>>> This patch splits is_partial_write() into
>>> is_partial_reg_write(),
>>> which
>>> represents the current is_partial_write(), useful for things
>>> like
>>> liveness analysis, and is_partial_var_write(), which
>>> considers
>>> the dispatch size to check if we are writing a full variable
>>> (rather
>>> than a full register) to decide if the write is partial or
>>> not,
>>> which
>>> is what we really want in many optimization passes.
>
> I actually started wondering why would liveness analysis and
> register
> coalescing need to treat the 16-bit SIMD8 case differently than
> optimizations.
> In virtual register space nothing would read or write the unused
> second half
> of the register in case of 16-bit type and SIMD8.

 True, we might be able to use the "variable" version in more cases.
 I
 was trying to be conservative when I implemented this because I
 don't
 think the half-float CTS tests provides a good testing ground for
 all
 aspects of the compiler. I can try that and see if it breaks
 anything
 though.
  
> Real register allocation in turn should be orthogonal to how
> things
> are
> allocated in virtual space. And I guess even there we wouldn't be
> interested
> of packing two 16-bit typed SIMD8 variables into one and same
> hardware
> register. It is SIMD16 where we get more pressure into register
> space
> anyway
> and there the 16-bit typed variables occupy full registers. In
> other
> words,
> if things fit in SIMD16, would we bother packing things more
> tightly
> in
> SIMD8? Or even if SIMD8 was the only option, would we be
> interested
> packing
> channels for two variables in one hw reg even then?
>
> Jason, we discussed this a little in the spring time.
>
> As a recap my approach shortly. Instead of ignoring the second
> half
> of
> registers case by case I addressed it more generally:
>
> - changed all the open coded checks to use helpers,
> - added a padding bit into fs_reg telling about the unused space,
> - change nir -> fs step to set that bit for 16-bit typed regs
> - and finally changed the helpers to consider the padding bit.

 So if I understand how this works, you mostly make the vgrf
 infrastructure think that half-float registers actually use twice
 the
 space they require by including the padding into the
 component_size()
 helper, 

Re: [Mesa-dev] [PATCH 3/3] glsl: fix typos in comments "transfor" -> "transform"

2018-11-21 Thread Chema Casanova
On 21/11/18 20:07, Ilia Mirkin wrote:
> Oh, yay, you fixed the typos here. I just had to keep reading.
> 
> This patch is obviously
> 
> Reviewed-by: Ilia Mirkin 

Thanks.

> 
> For the others ... have you run these through intel's CI?

> I'm interested in verifying that dEQP, CTS, and piglit all remain happy
> with the changes.

Yes. The CI is happy for dEQP, OpenGL CTS and piglit.

https://mesa-ci.01.org/jmcasanova/builds/13/group/63a9f0ea7bb98050796b649e85481845

It detects a regression not related to my changes in vulkancts on BDW
dEQP-VK.­subgroups.­shuffle.­subgroupshuffleup_bvec4_graphic, but in my
BDW it is a pass. (I've just resent the series to confirm it).

> The program resource stuff took a while to nail down
> properly (and it seems like we're discovering issues to this very
> day).

Yes, it took me a while to reduce the change as much as possible to
avoid changing current behavior except for these test cases. They are
corner cases that are not really useful in a real program as data can
not be used because we don't support NV_gpu_shader5 extension.

Thanks for checking the series.

>   -ilia
> On Wed, Nov 21, 2018 at 1:46 PM Jose Maria Casanova Crespo
>  wrote:
>>
>> ---
>>  src/compiler/glsl/ir.h | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/compiler/glsl/ir.h b/src/compiler/glsl/ir.h
>> index e09f053b77c..c3f5f1f7b05 100644
>> --- a/src/compiler/glsl/ir.h
>> +++ b/src/compiler/glsl/ir.h
>> @@ -773,17 +773,17 @@ public:
>>unsigned is_xfb_per_vertex_output:1;
>>
>>/**
>> -   * Was a transfor feedback buffer set in the shader?
>> +   * Was a transform feedback buffer set in the shader?
>> */
>>unsigned explicit_xfb_buffer:1;
>>
>>/**
>> -   * Was a transfor feedback offset set in the shader?
>> +   * Was a transform feedback offset set in the shader?
>> */
>>unsigned explicit_xfb_offset:1;
>>
>>/**
>> -   * Was a transfor feedback stride set in the shader?
>> +   * Was a transform feedback stride set in the shader?
>> */
>>unsigned explicit_xfb_stride:1;
>>
>> --
>> 2.19.1
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] glsl: XFB TSC per-vertex output varyings match as not declared as arrays

2018-11-21 Thread Chema Casanova


On 21/11/18 20:04, Ilia Mirkin wrote:
> On Wed, Nov 21, 2018 at 1:45 PM Jose Maria Casanova Crespo
>  wrote:
>>
>> Recent change on OpenGL CTS ("Use non-arrayed varying name for TCS blocks")
>> on KHR-GL*.tessellation_shader.single.xfb_captures_data_from_correct_stage
>> tests changed how to name per-vertex Tessellation Control Shader output
>> varyings in transform feedback using interface block as "BLOCK_INOUT.value"
>> rather than "BLOCK_INOUT[0].value"
>>
>> So Tessellation control shader per-vertex output variables and blocks that
>> are required to be declared as arrays, with each element representing output
>> values for a single vertex of a multi-vertex primitive are expected to be
>> named as they were not declared as arrays.
>>
>> This patch adds a new is_xfb_per_vertex_output flag at ir_variable level so
>> we mark when an ir_variable is an per-vertex TCS output varying. So we
>> treat it in terms on XFB its naming as a non array variable.
>>
>> As we don't support NV_gpu_shader5, so PATCHES mode is not accepted as
>> primitiveMode parameter of BeginTransformFeedback the test expects a
>> failure as we can use the XFB results.
>>
>> This patch uncovers that we were passing the GLES version of the tests
>> because candidates naming didn't match, not because on GLES the Tessellation
>> Control stage varyings shouldn't be XFB candidates in any case. This
>> is addressed in the following patch.
>>
>> Fixes: 
>> KHR-GL4*.tessellation_shader.single.xfb_captures_data_from_correct_stage
>>
>> Cc: mesa-sta...@lists.freedesktop.org
>> ---
>>  src/compiler/glsl/ir.cpp| 1 +
>>  src/compiler/glsl/ir.h  | 6 ++
>>  src/compiler/glsl/link_uniforms.cpp | 6 --
>>  src/compiler/glsl/link_varyings.cpp | 8 +++-
>>  4 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/compiler/glsl/ir.cpp b/src/compiler/glsl/ir.cpp
>> index 1d1a56ae9a5..582111d71f5 100644
>> --- a/src/compiler/glsl/ir.cpp
>> +++ b/src/compiler/glsl/ir.cpp
>> @@ -1750,6 +1750,7 @@ ir_variable::ir_variable(const struct glsl_type *type, 
>> const char *name,
>> this->data.fb_fetch_output = false;
>> this->data.bindless = false;
>> this->data.bound = false;
>> +   this->data.is_xfb_per_vertex_output = false;
>>
>> if (type != NULL) {
>>if (type->is_interface())
>> diff --git a/src/compiler/glsl/ir.h b/src/compiler/glsl/ir.h
>> index f478b29a6b5..e09f053b77c 100644
>> --- a/src/compiler/glsl/ir.h
>> +++ b/src/compiler/glsl/ir.h
>> @@ -766,6 +766,12 @@ public:
>> */
>>unsigned is_xfb_only:1;
>>
>> +  /**
>> +   * Is this varying a TSC per-vertex output candidate for transform
> 
> TCS?


Yes. I've fixed it locally at the commit summary too.


>> +   * feedback?
>> +   */
>> +  unsigned is_xfb_per_vertex_output:1;
>> +
>>/**
>> * Was a transfor feedback buffer set in the shader?
> 
> ugh, not your problem, but "transform" :(
> 
>> */
>> diff --git a/src/compiler/glsl/link_uniforms.cpp 
>> b/src/compiler/glsl/link_uniforms.cpp
>> index 63e688b19a7..547da68e216 100644
>> --- a/src/compiler/glsl/link_uniforms.cpp
>> +++ b/src/compiler/glsl/link_uniforms.cpp
>> @@ -72,8 +72,10 @@ program_resource_visitor::process(ir_variable *var, bool 
>> use_std430_as_default)
>>   get_internal_ifc_packing(use_std430_as_default) :
>>var->type->get_internal_ifc_packing(use_std430_as_default);
>>
>> -   const glsl_type *t =
>> -  var->data.from_named_ifc_block ? var->get_interface_type() : 
>> var->type;
>> +   const glsl_type *t = var->data.from_named_ifc_block ?
>> +  (var->data.is_xfb_per_vertex_output ?
>> +   var->get_interface_type()->without_array() :
>> +   var->get_interface_type()) : var->type;
>> const glsl_type *t_without_array = t->without_array();
>>
>> /* false is always passed for the row_major parameter to the other
>> diff --git a/src/compiler/glsl/link_varyings.cpp 
>> b/src/compiler/glsl/link_varyings.cpp
>> index 52e493cb599..1964dcc0a22 100644
>> --- a/src/compiler/glsl/link_varyings.cpp
>> +++ b/src/compiler/glsl/link_varyings.cpp
>> @@ -2150,7 +2150,10 @@ private:
>>tfeedback_candidate *candidate
>>   = rzalloc(this->mem_ctx, tfeedback_candidate);
>>candidate->toplevel_var = this->toplevel_var;
>> -  candidate->type = type;
>> +  if (this->toplevel_var->data.is_xfb_per_vertex_output)
>> + candidate->type = type->without_array();
>> +  else
>> + candidate->type = type;
>>candidate->offset = this->varying_floats;
>>_mesa_hash_table_insert(this->tfeedback_candidates,
>>ralloc_strdup(this->mem_ctx, name),
>> @@ -2499,6 +2502,9 @@ assign_varying_locations(struct gl_context *ctx,
>>
>>   if (num_tfeedback_decls > 0) {
>>  tfeedback_candidate_generator g(mem_ctx, tfeedback_candidates);
>> +if (producer->Stage == MESA_SHADER_TESS_CTRL &&
>> +

Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-08-20 Thread Chema Casanova
El 29/07/18 a las 19:47, Chema Casanova escribió:
> El 28/07/18 a las 01:45, Francisco Jerez escribió:
>> Chema Casanova  writes:

[...]

>>>>> If we have a partial write/read:
>>>>>
>>>>> I understood that you my initial patter proposal would only be ok for
>>>>> the first GRF of src[i]/dst (reg_offset == 0)
>>>>>
>>>>> periodic_mask(this->exec_size,   /* count */
>>>>>this->src[i].stride * type_sz(this->src[i].type), /*step */
>>>>>type_sz(this->src[i].type),   /* bits */
>>>>>this->src[i].offset % REG_SIZE);  /* offset */
>>>>>
>>>>> In the case we manage only reg_offset == 0 we get a huge improvement
>>>>> reducing all problems many of the register_pressure we have now on all
>>>>> SIMD8 shaders with 8/16bits test cases.
>>>>>
>>>>> I understood that you didn't agree that for cases where src/destination
>>>>> use more than 1 GRF (reg_offset == 1) we can not guarantee that we can
>>>>> apply the same internal offset (this->src[i].offset % REG_SIZE) as the
>>>>> base register to calculate a patter. So It would be better to return ~0u
>>>>> on reads or 0u in writes.
>>>>>
>>>
>>>> Yes, but you could easily determine whether the mask is going to be
>>>> invariant with respect to reg_offset (where reg_offset is within bounds)
>>>> and in that case return the periodic_mask() expression above, otherwise
>>>> return 0/~0u depending on whether reg_offset is within bounds.
>>>
>>> Ok, so we are within bounds, we don't have a predicated write, we are
>>> not a send message. Then we have an ALU opcode and we return the
>>> periodic_mask.
>>>
>>
>> Those are all necessary but not sufficient conditions for the
>> periodic_mask() expression above to give you the correct answer for any
>> in-bounds reg_offset > 0, you should check that byte_offset < type_size
>> * stride in addition.
> 
> That's true. Fixed in v5.
> 
> If we don't satisfy the condition then we return 0 on writes and ~0u on
> reads.

Could you have a look at the v5 to check if I can count with your R-b ?

https://patchwork.freedesktop.org/patch/241482/

I suppose you didn't have time to have a look at the other patch of the
series.

"[v2,2/2] intel/fs: Improve liveness range calculation for partial writes"
https://patchwork.freedesktop.org/patch/239839/

Thanks in advance,

Chema

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-29 Thread Chema Casanova
El 28/07/18 a las 01:45, Francisco Jerez escribió:
> Chema Casanova  writes:
> 
>> El 27/07/18 a las 02:44, Francisco Jerez escribió:
>>> Chema Casanova  writes:
>>>
>>>> El 26/07/18 a las 20:02, Francisco Jerez escribió:
>>>>> Chema Casanova  writes:
>>>>>
>>>>>> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>>>>>>> Chema Casanova  writes:
>>>>>>>
>>>>>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>>>>>>> Chema Casanova  writes:
>>>>>>>>>
>>>>>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>>>>>>> Jose Maria Casanova Crespo  writes:
>>>>>>>>>>>
>>>>>>>>>>>> For a register source/destination of an instruction the function 
>>>>>>>>>>>> returns
>>>>>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned 
>>>>>>>>>>>> int.
>>>>>>>>>>>>
>>>>>>>>>>>> The returned pattern takes into account the exec_size of the 
>>>>>>>>>>>> instruction,
>>>>>>>>>>>> the type bitsize, the stride and if the register is source or 
>>>>>>>>>>>> destination.
>>>>>>>>>>>>
>>>>>>>>>>>> The objective of the functions if to help to know the read/written 
>>>>>>>>>>>> bytes
>>>>>>>>>>>> of the instructions to improve the liveness analysis for partial 
>>>>>>>>>>>> read/writes.
>>>>>>>>>>>>
>>>>>>>>>>>> We manage special cases for 
>>>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the 
>>>>>>>>>>>> bitsize
>>>>>>>>>>>> parameter they have a different read pattern.
>>>>>>>>>>>> ---
>>>>>>>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 
>>>>>>>>>>>> +
>>>>>>>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>>>>>>>  2 files changed, 184 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>>>>>>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>>>>>> this->dst.offset % REG_SIZE != 0);
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated 
>>>>>>>>>>>> register byte
>>>>>>>>>>>> + * has been read/written by the instruction. The returned pattern 
>>>>>>>>>>>> takes into
>>>>>>>>>>>> + * account the exec_size of the instruction, the type bitsize and 
>>>>>>>>>>>> the register
>>>>>>>>>>>> + * stride and the register is source or destination for the 
>>>>>>>>>>>> instruction.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * The objective of this function is to identify which parts of 
>>>>>>>>>>>> the register
>>>>>>>>>>>> + * are read or written for operations that don't read/write a 
>>>>>>>>>>>> full register.
>>>>>>>>>>>> + * So we can identify in live range variable analysis if a 
>>>>>>>

[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-27 Thread Chema Casanova
El 27/07/18 a las 02:44, Francisco Jerez escribió:
> Chema Casanova  writes:
> 
>> El 26/07/18 a las 20:02, Francisco Jerez escribió:
>>> Chema Casanova  writes:
>>>
>>>> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>>>>> Chema Casanova  writes:
>>>>>
>>>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>>>>> Chema Casanova  writes:
>>>>>>>
>>>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>>>>> Jose Maria Casanova Crespo  writes:
>>>>>>>>>
>>>>>>>>>> For a register source/destination of an instruction the function 
>>>>>>>>>> returns
>>>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>>>>>>
>>>>>>>>>> The returned pattern takes into account the exec_size of the 
>>>>>>>>>> instruction,
>>>>>>>>>> the type bitsize, the stride and if the register is source or 
>>>>>>>>>> destination.
>>>>>>>>>>
>>>>>>>>>> The objective of the functions if to help to know the read/written 
>>>>>>>>>> bytes
>>>>>>>>>> of the instructions to improve the liveness analysis for partial 
>>>>>>>>>> read/writes.
>>>>>>>>>>
>>>>>>>>>> We manage special cases for 
>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the 
>>>>>>>>>> bitsize
>>>>>>>>>> parameter they have a different read pattern.
>>>>>>>>>> ---
>>>>>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 
>>>>>>>>>> +
>>>>>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>>>>>  2 files changed, 184 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>>>>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>>>> this->dst.offset % REG_SIZE != 0);
>>>>>>>>>>  }
>>>>>>>>>>  
>>>>>>>>>> +/**
>>>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated 
>>>>>>>>>> register byte
>>>>>>>>>> + * has been read/written by the instruction. The returned pattern 
>>>>>>>>>> takes into
>>>>>>>>>> + * account the exec_size of the instruction, the type bitsize and 
>>>>>>>>>> the register
>>>>>>>>>> + * stride and the register is source or destination for the 
>>>>>>>>>> instruction.
>>>>>>>>>> + *
>>>>>>>>>> + * The objective of this function is to identify which parts of the 
>>>>>>>>>> register
>>>>>>>>>> + * are read or written for operations that don't read/write a full 
>>>>>>>>>> register.
>>>>>>>>>> + * So we can identify in live range variable analysis if a partial 
>>>>>>>>>> write has
>>>>>>>>>> + * completelly defined the part of the register used by a partial 
>>>>>>>>>> read. So we
>>>>>>>>>> + * avoid extending the liveness range because all data read was 
>>>>>>>>>> already
>>>>>>>>>> + * defined although the wasn't completely written.
>>>>>>>>>> + */
>>>>>>>>>> +unsigned
>>>>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg , boolean

Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-26 Thread Chema Casanova
El 26/07/18 a las 20:02, Francisco Jerez escribió:
> Chema Casanova  writes:
> 
>> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>>> Chema Casanova  writes:
>>>
>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>>> Chema Casanova  writes:
>>>>>
>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>>> Jose Maria Casanova Crespo  writes:
>>>>>>>
>>>>>>>> For a register source/destination of an instruction the function 
>>>>>>>> returns
>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>>>>
>>>>>>>> The returned pattern takes into account the exec_size of the 
>>>>>>>> instruction,
>>>>>>>> the type bitsize, the stride and if the register is source or 
>>>>>>>> destination.
>>>>>>>>
>>>>>>>> The objective of the functions if to help to know the read/written 
>>>>>>>> bytes
>>>>>>>> of the instructions to improve the liveness analysis for partial 
>>>>>>>> read/writes.
>>>>>>>>
>>>>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>>>>>> parameter they have a different read pattern.
>>>>>>>> ---
>>>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 +
>>>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>>>  2 files changed, 184 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>> this->dst.offset % REG_SIZE != 0);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +/**
>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated 
>>>>>>>> register byte
>>>>>>>> + * has been read/written by the instruction. The returned pattern 
>>>>>>>> takes into
>>>>>>>> + * account the exec_size of the instruction, the type bitsize and the 
>>>>>>>> register
>>>>>>>> + * stride and the register is source or destination for the 
>>>>>>>> instruction.
>>>>>>>> + *
>>>>>>>> + * The objective of this function is to identify which parts of the 
>>>>>>>> register
>>>>>>>> + * are read or written for operations that don't read/write a full 
>>>>>>>> register.
>>>>>>>> + * So we can identify in live range variable analysis if a partial 
>>>>>>>> write has
>>>>>>>> + * completelly defined the part of the register used by a partial 
>>>>>>>> read. So we
>>>>>>>> + * avoid extending the liveness range because all data read was 
>>>>>>>> already
>>>>>>>> + * defined although the wasn't completely written.
>>>>>>>> + */
>>>>>>>> +unsigned
>>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg , boolean is_dst) 
>>>>>>>> const
>>>>>>>> +{
>>>>>>>> +   if (is_dst) {
>>>>>>
>>>>>>> Please split into two functions (like fs_inst::src_read and
>>>>>>> ::src_written) since that would make the call-sites of this method more
>>>>>>> self-documenting than a boolean parameter.  You should be able to share
>>>>>>> code by refactoring the common logic into a separate function (see below
>>>>>>> for some suggestions on how that could be achieved).
>>>>>>
>>>>>> Sure, it would improve readability and simplifies the logic, I've cho

Re: [Mesa-dev] [PATCH] intel/compiler: fix lower conversions to account for predication

2018-07-26 Thread Chema Casanova
Please include:

Fixes: 5a12bdac09496e00 "i965/compiler: handle conversion to smaller
 type in the lowering pass for that"

Reviewed-by: Jose Maria Casanova Crespo 

El 17/07/18 a las 11:10, Iago Toral Quiroga escribió:
> The pass can create a temporary result for the instruction and then
> moves from it to the original destination, however, if the original
> instruction was predicated, the mov has to be predicated as well.
> ---
>  src/intel/compiler/brw_fs_lower_conversions.cpp | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_fs_lower_conversions.cpp 
> b/src/intel/compiler/brw_fs_lower_conversions.cpp
> index e27e2402746..145fb55f995 100644
> --- a/src/intel/compiler/brw_fs_lower_conversions.cpp
> +++ b/src/intel/compiler/brw_fs_lower_conversions.cpp
> @@ -98,7 +98,10 @@ fs_visitor::lower_conversions()
>   * size_written accordingly.
>   */
>  inst->size_written = inst->dst.component_size(inst->exec_size);
> -ibld.at(block, inst->next).MOV(dst, strided_temp)->saturate = 
> saturate;
> +
> +fs_inst *mov = ibld.at(block, inst->next).MOV(dst, strided_temp);
> +mov->saturate = saturate;
> +mov->predicate = inst->predicate;
>  
>  progress = true;
>   }
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH mesa] anv: don't crash on vkDestroyDevice(NULL)

2018-07-25 Thread Chema Casanova
Reviewed-by: Jose Maria Casanova Crespo 

El 25/07/18 a las 21:25, Eric Engestrom escribió:
> On Wednesday, 2018-07-25 19:45:56 +0100, Eric Engestrom wrote:
>> CovID: 1438132
>> Signed-off-by: Eric Engestrom 
> 
> Forgot to check before sending:
> 
> Fixes: a99c9e63a07477634ab73 "anv: finish the binding_table_pool on
>   destroyDevice when use_softpin"
> Cc: Jose Maria Casanova Crespo 
> 
>> ---
>>  src/intel/vulkan/anv_device.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/intel/vulkan/anv_device.c b/src/intel/vulkan/anv_device.c
>> index 04fd6a829ed60081abc4..3664f80c24dc34955196 100644
>> --- a/src/intel/vulkan/anv_device.c
>> +++ b/src/intel/vulkan/anv_device.c
>> @@ -1832,11 +1832,13 @@ void anv_DestroyDevice(
>>  const VkAllocationCallbacks*pAllocator)
>>  {
>> ANV_FROM_HANDLE(anv_device, device, _device);
>> -   struct anv_physical_device *physical_device = 
>> >instance->physicalDevice;
>> +   struct anv_physical_device *physical_device;
>>  
>> if (!device)
>>return;
>>  
>> +   physical_device = >instance->physicalDevice;>> 
>> anv_device_finish_blorp(device);
>>  
>> anv_pipeline_cache_finish(>default_pipeline_cache);
>> -- 
>> Cheers,
>>   Eric
>>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-23 Thread Chema Casanova
El 20/07/18 a las 22:10, Francisco Jerez escribió:
> Chema Casanova  writes:
> 
>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>> Chema Casanova  writes:
>>>
>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>> Jose Maria Casanova Crespo  writes:
>>>>>
>>>>>> For a register source/destination of an instruction the function returns
>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>>
>>>>>> The returned pattern takes into account the exec_size of the instruction,
>>>>>> the type bitsize, the stride and if the register is source or 
>>>>>> destination.
>>>>>>
>>>>>> The objective of the functions if to help to know the read/written bytes
>>>>>> of the instructions to improve the liveness analysis for partial 
>>>>>> read/writes.
>>>>>>
>>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>>>> parameter they have a different read pattern.
>>>>>> ---
>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 +
>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>  2 files changed, 184 insertions(+)
>>>>>>
>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>> this->dst.offset % REG_SIZE != 0);
>>>>>>  }
>>>>>>  
>>>>>> +/**
>>>>>> + * Returns a 32-bit uint whose bits represent if the associated 
>>>>>> register byte
>>>>>> + * has been read/written by the instruction. The returned pattern takes 
>>>>>> into
>>>>>> + * account the exec_size of the instruction, the type bitsize and the 
>>>>>> register
>>>>>> + * stride and the register is source or destination for the instruction.
>>>>>> + *
>>>>>> + * The objective of this function is to identify which parts of the 
>>>>>> register
>>>>>> + * are read or written for operations that don't read/write a full 
>>>>>> register.
>>>>>> + * So we can identify in live range variable analysis if a partial 
>>>>>> write has
>>>>>> + * completelly defined the part of the register used by a partial read. 
>>>>>> So we
>>>>>> + * avoid extending the liveness range because all data read was already
>>>>>> + * defined although the wasn't completely written.
>>>>>> + */
>>>>>> +unsigned
>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg , boolean is_dst) 
>>>>>> const
>>>>>> +{
>>>>>> +   if (is_dst) {
>>>>
>>>>> Please split into two functions (like fs_inst::src_read and
>>>>> ::src_written) since that would make the call-sites of this method more
>>>>> self-documenting than a boolean parameter.  You should be able to share
>>>>> code by refactoring the common logic into a separate function (see below
>>>>> for some suggestions on how that could be achieved).
>>>>
>>>> Sure, it would improve readability and simplifies the logic, I've chosen
>>>> dst_write_pattern and src_read_pattern.
>>>>
>>>>>
>>>>>> +  /* We don't know what is written so we return the worts case */
>>>>>
>>>>> "worst"
>>>>
>>>> Fixed.
>>>>
>>>>>> +  if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>>>> + return 0;
>>>>>> +  /* We assume that send destinations are completely written */
>>>>>> +  if (this->is_send_from_grf())
>>>>>> + return ~0u;
>>>>>
>>>>> Some send-like instructions won't be caught by this condition, you
>>>>> should check for this-

Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-20 Thread Chema Casanova
El 20/07/18 a las 00:34, Francisco Jerez escribió:
> Chema Casanova  writes:
> 
>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>> Jose Maria Casanova Crespo  writes:
>>>
>>>> For a register source/destination of an instruction the function returns
>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>
>>>> The returned pattern takes into account the exec_size of the instruction,
>>>> the type bitsize, the stride and if the register is source or destination.
>>>>
>>>> The objective of the functions if to help to know the read/written bytes
>>>> of the instructions to improve the liveness analysis for partial 
>>>> read/writes.
>>>>
>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>> parameter they have a different read pattern.
>>>> ---
>>>>  src/intel/compiler/brw_fs.cpp  | 183 +
>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>  2 files changed, 184 insertions(+)
>>>>
>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>> this->dst.offset % REG_SIZE != 0);
>>>>  }
>>>>  
>>>> +/**
>>>> + * Returns a 32-bit uint whose bits represent if the associated register 
>>>> byte
>>>> + * has been read/written by the instruction. The returned pattern takes 
>>>> into
>>>> + * account the exec_size of the instruction, the type bitsize and the 
>>>> register
>>>> + * stride and the register is source or destination for the instruction.
>>>> + *
>>>> + * The objective of this function is to identify which parts of the 
>>>> register
>>>> + * are read or written for operations that don't read/write a full 
>>>> register.
>>>> + * So we can identify in live range variable analysis if a partial write 
>>>> has
>>>> + * completelly defined the part of the register used by a partial read. 
>>>> So we
>>>> + * avoid extending the liveness range because all data read was already
>>>> + * defined although the wasn't completely written.
>>>> + */
>>>> +unsigned
>>>> +fs_inst::register_byte_use_pattern(const fs_reg , boolean is_dst) const
>>>> +{
>>>> +   if (is_dst) {
>>
>>> Please split into two functions (like fs_inst::src_read and
>>> ::src_written) since that would make the call-sites of this method more
>>> self-documenting than a boolean parameter.  You should be able to share
>>> code by refactoring the common logic into a separate function (see below
>>> for some suggestions on how that could be achieved).
>>
>> Sure, it would improve readability and simplifies the logic, I've chosen
>> dst_write_pattern and src_read_pattern.
>>
>>>
>>>> +  /* We don't know what is written so we return the worts case */
>>>
>>> "worst"
>>
>> Fixed.
>>
>>>> +  if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>> + return 0;
>>>> +  /* We assume that send destinations are completely written */
>>>> +  if (this->is_send_from_grf())
>>>> + return ~0u;
>>>
>>> Some send-like instructions won't be caught by this condition, you
>>> should check for this->mlen != 0 in addition.
>>
>> Would it be enough to check for (this->mlen > 0) and forget about
>> is_send_from_grf? I am using this approach in v2 I am sending.
>>
> 
> I don't think the mlen > 0 condition would catch all cases either...
> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC.  You probably need both
> conditions.  Sucks...

That is true, so now we have the:
 (this->is_send_from_grf() || this->mlen != 0)

>>>> +   } else {
>>>> +  /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>>> +   * so the read pattern depends on the bitsize stored at src[4]
>>>> +   */
>>>> +  if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL 

[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

2018-07-19 Thread Chema Casanova
El 14/07/18 a las 00:14, Francisco Jerez escribió:
> Jose Maria Casanova Crespo  writes:
> 
>> For a register source/destination of an instruction the function returns
>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>
>> The returned pattern takes into account the exec_size of the instruction,
>> the type bitsize, the stride and if the register is source or destination.
>>
>> The objective of the functions if to help to know the read/written bytes
>> of the instructions to improve the liveness analysis for partial read/writes.
>>
>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>> parameter they have a different read pattern.
>> ---
>>  src/intel/compiler/brw_fs.cpp  | 183 +
>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>  2 files changed, 184 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>> index 2b8363ca362..f3045c4ff6c 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>> this->dst.offset % REG_SIZE != 0);
>>  }
>>  
>> +/**
>> + * Returns a 32-bit uint whose bits represent if the associated register 
>> byte
>> + * has been read/written by the instruction. The returned pattern takes into
>> + * account the exec_size of the instruction, the type bitsize and the 
>> register
>> + * stride and the register is source or destination for the instruction.
>> + *
>> + * The objective of this function is to identify which parts of the register
>> + * are read or written for operations that don't read/write a full register.
>> + * So we can identify in live range variable analysis if a partial write has
>> + * completelly defined the part of the register used by a partial read. So 
>> we
>> + * avoid extending the liveness range because all data read was already
>> + * defined although the wasn't completely written.
>> + */
>> +unsigned
>> +fs_inst::register_byte_use_pattern(const fs_reg , boolean is_dst) const
>> +{
>> +   if (is_dst) {

> Please split into two functions (like fs_inst::src_read and
> ::src_written) since that would make the call-sites of this method more
> self-documenting than a boolean parameter.  You should be able to share
> code by refactoring the common logic into a separate function (see below
> for some suggestions on how that could be achieved).

Sure, it would improve readability and simplifies the logic, I've chosen
dst_write_pattern and src_read_pattern.

> 
>> +  /* We don't know what is written so we return the worts case */
> 
> "worst"

Fixed.

>> +  if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>> + return 0;
>> +  /* We assume that send destinations are completely written */
>> +  if (this->is_send_from_grf())
>> + return ~0u;
> 
> Some send-like instructions won't be caught by this condition, you
> should check for this->mlen != 0 in addition.

Would it be enough to check for (this->mlen > 0) and forget about
is_send_from_grf? I am using this approach in v2 I am sending.

>> +   } else {
>> +  /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>> +   * so the read pattern depends on the bitsize stored at src[4]
>> +   */
>> +  if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>> +  this->src[1].nr == r.nr) {

> I feel uncomfortable about attempting to guess the source the caller is
> referring to by comparing the registers for equality.  E.g.  you could
> potentially end up with two sources that compare equal but have
> different semantics (e.g. as a result of CSE) which might cause it to
> get the wrong answer.  It would probably be better to pass a source
> index and a byte offset as argument instead of an fs_reg.

I've didn't thought about CSE, I'm now receiving the number of source
and the reg_offset. I'm using reg_offset instead of byte offsets as it
simplifies the logic. Now we are using always the base src register to
do all the calculation
>> + switch (this->src[4].ud) {
>> + case 32:
>> +return ~0u;
>> + case 16:
>> +return 0x;
>> + case 8:
>> +return 0x;
>> + default:
>> +unreachable("Unsupported bitsize at 
>> byte_scattered_write_logical");
>> + }
> 
> Replace the above switch statement with a call to "periodic_mask(8, 4,
> this->src[4].ud / 8)" (see below for the definition).

Ok.

>> +  }
>> +  /* As for byte_scattered_write_logical but we need to take into 
>> account
>> +   * that data written are in the payload offset 32 with SIMD8 and 
>> offset
>> +   * 64 with SIMD16.
>> +   */
>> +  if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>> +  this->src[0].nr == r.nr) {
>> + fs_reg payload = 

Re: [Mesa-dev] [PATCH] i965/fs: unspills shoudn't use grf127 as dest since Gen8+

2018-07-12 Thread Chema Casanova
El 12/07/18 a las 03:23, Caio Marcelo de Oliveira Filho escribió:
> On Wed, Jul 11, 2018 at 06:03:05PM +0200, Jose Maria Casanova Crespo wrote:
>> At 232ed8980217dd65ab0925df28156f565b94b2e5 "i965/fs: Register allocator
>> shoudn't use grf127 for sends dest" we didn't take into account the case
>> of SEND instructions that are not send_from_grf. But since Gen7+ although
>> the backend still uses MRFs internally for sends they are finally asigned
>> to a GRFs.
> 
> Typo "assigned".

Fixed.

>> In the case of unspills the backend assigns directly as source its
>> destination because it is suppose to be available. So we always have a
>> source-destination overlap. If the reg_allocator asigns registers that
> 
> Typo "assigns".

Fixed.

>> include de grf127 we fail the validation rule that affects Gen8+
> 
> Typo "the".

Fixed.

>> "r127 must not be used for return address when there is a src and dest
>> overlap in send instruction."
>>
>> So this patch activates the grf127_send_hack_node for Gen8+ and if we have
>> any register spilled we add interferences to the destination of the unspill
>> operations.
> 
> I've spent some time testing why this patch was still not covering all
> the cases yet. The opt_bank_conflicts() optimization, that runs after
> the register allocation, was moving things around, causing the r127 to
> be used in the condition we were avoiding it.
> 
> The code there already has the idea of not touching certain registers,
> so we should add something like
> 
>   /* At Intel Broadwell PRM, vol 07, section "Instruction Set Reference",
>* subsection "EUISA Instructions", Send Message (page 990):
>*
>* "r127 must not be used for return address when there is a src and
>* dest overlap in send instruction."
>*
>* Register allocation ensures that, so don't move 127 around to avoid
>* breaking that property.
>*/ 
>   if (v->devinfo->gen >= 8)
>  constrained[p.atom_of_reg(127)] = true;
> 
> to function shader_reg_constraints() in
> brw_fs_bank_conflicts.cpp. This fixes the crashes I was seeing in
> shader-db.
> 
> With the change to bank conflicts and the typos/style fixed, this
> patch is


Good finding. I like the clean and simple solution. At that point of
optimizing back conflicts I don't find a better way to don't mess with
grf127, although we are forbidding legal permutations when not SEND
instructions are in place. I've just putting the your code after the
constrains for reg0 and reg1.

I've also confirmed that that that I run a full shader-db without
crashes caused by this validation rule and the performance impact of the
patch doesn't seem to be too much taking into account that we are
avoiding generating instructions with undefined return values.

total instructions in shared programs: 14867211 -> 14867218 (<.01%)
instructions in affected programs: 5314 -> 5321 (0.13%)
helped: 1
HURT: 1

total cycles in shared programs: 537925161 -> 537923248 (<.01%)
cycles in affected programs: 44939136 -> 44937223 (<.01%)
helped: 10
HURT: 23

total spills in shared programs: 7789 -> 7790 (0.01%)
spills in affected programs: 107 -> 108 (0.93%)
helped: 0
HURT: 1

total fills in shared programs: 10555 -> 10557 (0.02%)
fills in affected programs: 155 -> 157 (1.29%)
helped: 0
HURT: 1

> Reviewed-by: Caio Marcelo de Oliveira Filho 

Thanks for the review.

Chema

> 
> Reviewed-by: Caio Marcelo de Oliveira Filho 
> 
> 
>> +  if (spilled_any_registers) {
>> + foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> +if ((inst->opcode == SHADER_OPCODE_GEN7_SCRATCH_READ ||
>> +inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ) &&
>> +inst->dst.file ==VGRF) {
> 
> Missing space after the "==".
> 
>> +   ra_add_node_interference(g, inst->dst.nr, 
>> grf127_send_hack_node);
>> +}
>>   }
>>}
>> }
>>  
>> +
> 
> Extra newline?
> 
> 
> Thanks,
> Caio
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] i965/fs: Generalize grf127 hack to dispatch_width > 8

2018-07-11 Thread Chema Casanova
El 11/07/18 a las 03:50, Caio Marcelo de Oliveira Filho escribió:
> Change the hack to always apply, adjusting the register number
> according to the dispatch_width.
> 
> The original change assumed that given for dispatch_width > 8 we
> already prevent the overlap of source and destination for send, it
> would not be necessary to explicitly add an interference with a
> register that covers r127.
> 
> The problem is that the code for spilling registers ends up generating
> scratch reads, that in Gen7+ will reuse the destination register,
> causing a send with both source and destination overlaping. So prevent
> r127 (or the overlapping wider register) to be used as destination for
> sends.
> 
> This patch fixes piglit test
> tests/spec/arb_compute_shader/linker/bug-93840.shader_test.
> 
> Fixes: 232ed898021 "i965/fs: Register allocator shoudn't use grf127 for sends 
> dest"
> ---
> 
> After more digging on the piglit failure, I came up with this
> patch. I'm still seeing crashes with for some shader-db executions
> (master have them too), but didn't have time today to drill into them
> 
>  src/intel/compiler/brw_fs_reg_allocate.cpp | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp 
> b/src/intel/compiler/brw_fs_reg_allocate.cpp
> index 59e047483c0..417ddeba09c 100644
> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp
> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
> @@ -549,7 +549,7 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
> if (devinfo->gen >= 7)
>node_count += BRW_MAX_GRF - GEN7_MRF_HACK_START;
> int grf127_send_hack_node = node_count;
> -   if (devinfo->gen >= 8 && dispatch_width == 8)
> +   if (devinfo->gen >= 8 && dispatch_width >= 8)
>node_count ++;
> struct ra_graph *g =
>ra_alloc_interference_graph(compiler->fs_reg_sets[rsi].regs, 
> node_count);
> @@ -656,7 +656,7 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
>}
> }
>  
> -   if (devinfo->gen >= 8 && dispatch_width == 8) {
> +   if (devinfo->gen >= 8 && dispatch_width >= 8) {
>/* At Intel Broadwell PRM, vol 07, section "Instruction Set Reference",
> * subsection "EUISA Instructions", Send Message (page 990):
> *
> @@ -665,12 +665,9 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
> *
> * We are avoiding using grf127 as part of the destination of send
> * messages adding a node interference to the grf127_send_hack_node.
> -   * This node has a fixed asignment to grf127.
> -   *
> -   * We don't apply it to SIMD16 because previous code avoids any 
> register
> -   * overlap between sources and destination.
> +   * This node has a fixed assignment that overlaps with grf127.
> */
> -  ra_set_node_reg(g, grf127_send_hack_node, 127);
> +  ra_set_node_reg(g, grf127_send_hack_node, 128 - reg_width);

This configuration is more restrictive than needed. The original code
just avoids any register with any length that uses the physical register
grf127. Your code works for SIMD16, but as you are setting conflicts
with grf126 in SIMD16, you are forbidding the use of grf125 using with
regsize=2, and the same with grf123 with size 4, when this options never
use grf127. You don't need to take care of the reg_width here, just
about which physical register you can not use.

At brw_alloc_reg_set() you can check how the different registers are
defined using classes are used for different sizes. It also configures
the conflicts among the registers with different sizes and the physical
register.

So if at this point you create a node assigned to a physical register
you have conflicts with all the logical registers with any size that
overlap with it.

>foreach_block_and_inst(block, fs_inst, inst, cfg) {
>   if (inst->is_send_from_grf() && inst->dst.file == VGRF) {
>  ra_add_node_interference(g, inst->dst.nr, grf127_send_hack_node);
> 

The issue here is that the unspill instructions aren't in the list of
the is_send_from_grf. I thought we could update is_send_from_grf to
include the read/write scratch operations but finally I think that it
didn't have sense because  the source at this point is an MRF that will
be finally assigned to a GRF on Gen7+.

I've sent a patch with my solution that I think solves the case of
unspill that is creating this problem, but maybe we need to think if
there are more SEND instructions that could have this problem because of
using the MRF as source.

Chema

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/9] i965/fs: Register allocator shoudn't use grf127 for sends dest

2018-07-11 Thread Chema Casanova
Including mesa-dev in my previous reply.

El 11/07/18 a las 01:08, Caio Marcelo de Oliveira Filho escribió:
>> Since Gen8+ Intel PRM states that "r127 must not be used for return
>> address when there is a src and dest overlap in send instruction."
> 
> The previous patch, that verifies the condition above is causing
> 
> tests/spec/arb_compute_shader/linker/bug-93840.shader_test
> 
> to crash with
> 
> shader_runner: ../src/intel/compiler/brw_fs_generator.cpp:2455: int 
> fs_generator::generate_code(const cfg_t*, int): Assertion `validated' failed.
> 
> I also could reproduce the crash locally. It happens even with this
> patch (which adds the hack) applied.

I've seen it in Jenkins, but couldn't reproduce it so I thought it
wasn't related. Now I've realized that I was using a release build at
that moment.

The good thing is that the validator rule has detected that the
generated instruction was incorrect.


>> This patch implements this restriction creating new grf127_send_hack_node
>> at the register allocator. This node has a fixed assignation to grf127.
>>
>> For vgrf that are used as destination of send messages we create node
>> interfereces with the grf127_send_hack_node. So the register allocator
>> will never assign to these vgrf a register that involves grf127.
>>
>> If dispatch_width > 8 we don't create these interferences to the because
>> all instructions have node interferences between sources and destination.
>> That is enough to avoid the r127 restriction.
> 
> I think for both widths will not be enough. The instruction that fails
> the validation is:
> 
> mov(8)  g126<1>UD   g0<8,8,1>UD { align1 
> WE_all 1Q };
> mov(1)  g126.2<1>UD 0x0090UD{ align1 
> WE_all 1N };
> send(16)g126<1>UW   g126<8,8,1>UD
> data ( DC OWORD block read, 253, 3) mlen 1 rlen 2 
> { align1 WE_all 1H };
> ERROR: r127 must not be used for return address when there is a src 
> and dest overlap
> 
> Which if I understood correctly comes from the scratch reading being
> created by the spilling logic. In brw_oword_block_read_scratch() we
> see
> 
>if (p->devinfo->gen >= 7) {
>   /* On gen 7 and above, we no longer have message registers and we can
>* send from any register we want.  By using the destination register
>* for the message, we guarantee that the implied message write won't
>* accidentally overwrite anything.  This has been a problem because
>* the MRF registers and source for the final FB write are both fixed
>* and may overlap.
>*/
>   mrf = retype(dest, BRW_REGISTER_TYPE_UD);
>} else {
>   mrf = retype(mrf, BRW_REGISTER_TYPE_UD);
>}
>dest = retype(dest, BRW_REGISTER_TYPE_UW);
> 
> It seems to me we'll have to handle r127 there as well.

Yes, as in this case source and destination are coded to be the same
vgrf, we don't have a source/destination interference on SIMD16.

I'm doing some extra testing but something like next code at
assigns_regs seems to fix the issue:


  if (spilled_any_registers) {
 foreach_block_and_inst(block, fs_inst, inst, cfg) {
if (inst->opcode == SHADER_OPCODE_GEN7_SCRATCH_READ ||
inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ) {
   ra_add_node_interference(g, inst->dst.nr,
grf127_send_hack_node);
}
 }

Thanks Caio for digging into the problem. I'm sending today a patch to
deal with this case.

Chema


>>
>> This fixes CTS tests that raised this issue as they were executed as SIMD8:
>>
>> dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.8struct_to_32struct.storage_buffer_*int_geom
>>
>> Shader-db results on Skylake:
>>total instructions in shared programs: 7686798 -> 7686797 (<.01%)
>>instructions in affected programs: 301 -> 300 (-0.33%)
>>helped: 1
>>HURT: 0
>>
>>total cycles in shared programs: 337092322 -> 337091919 (<.01%)
>>cycles in affected programs: 22420415 -> 22420012 (<.01%)
>>helped: 712
>>HURT: 588
>>
>> Shader-db results on Broadwell:
>>
>>total instructions in shared programs: 7658574 -> 7658625 (<.01%)
>>instructions in affected programs: 19610 -> 19661 (0.26%)
>>helped: 3
>>HURT: 4
>>
>>total cycles in shared programs: 340694553 -> 340676378 (<.01%)
>>cycles in affected programs: 24724915 -> 24706740 (-0.07%)
>>helped: 998
>>HURT: 916
>>
>>total spills in shared programs: 4300 -> 4311 (0.26%)
>>spills in affected programs: 333 -> 344 (3.30%)
>>helped: 1
>>HURT: 3
>>
>>total fills in shared programs: 5370 -> 5378 (0.15%)
>>fills in affected programs: 274 -> 282 (2.92%)
>>helped: 1
>>HURT: 3
>>
>> v2: Avoid duplicating register classes without grf127. Let's use a node
>> with a fixed assignation to grf127 and create interferences to send
>> message vgrf destinations. (Eric Anholt)
>> v3: 

Re: [Mesa-dev] [PATCH 01/14] intel/fs: general 8/16/32/64-bit shuffle_src_to_dst function (v2)

2018-06-15 Thread Chema Casanova
On 15/06/18 06:50, Jason Ekstrand wrote:
> On Thu, Jun 14, 2018 at 6:06 PM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> This new function takes care of shuffle/unshuffle components of a
> particular bit-size in components with a different bit-size.
> 
> If source type size is smaller than destination type size the operation
> needed is a component shuffle. The opposite case would be an unshuffle.
> 
> Component units are measured in terms of the smaller type between
> source and destination. As we are un/shuffling the smaller components
> from/into a bigger one.
> 
> The operation allows to skip first_component number of components from
> the source.
> 
> Shuffle MOVs are retyped using integer types avoiding problems with
> denorms and float types if source and destination bitsize is different.
> This allows to simplify uses of shuffle functions that are dealing with
> these retypes individually.
> 
> Now there is a new restriction so source and destination can not overlap
> anymore when calling this shuffle function. Following patches that
> migrate
> to use this new function will take care individually of avoiding source
> and destination overlaps.
> 
> v2: (Jason Ekstrand)
>     - Rewrite overlap asserts.
>     - Manage type_sz(src.type) == type_sz(dst.type) case using MOVs
>       from source to dest. This works for 64-bit to 64-bits
>       operation that on Gen7 as it doesn't support Q registers.
>     - Explain that components units are based in the smallest type.
> 
> Cc: Jason Ekstrand mailto:ja...@jlekstrand.net>>
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 100 ++
>  1 file changed, 100 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 166da0aa6d7..9c5afc9c46f 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -5362,6 +5362,106 @@ shuffle_16bit_data_for_32bit_write(const
> fs_builder ,
>     }
>  }
> 
> +/*
> + * This helper takes a source register and un/shuffles it into the
> destination
> + * register.
> + *
> + * If source type size is smaller than destination type size the
> operation
> + * needed is a component shuffle. The opposite case would be an
> unshuffle. If
> + * source/destination type size is equal a shuffle is done that
> would be
> + * equivalent to a simple MOV.
> + *
> + * For example, if source is a 16-bit type and destination is
> 32-bit. A 3
> + * components .xyz 16-bit vector on SIMD8 would be.
> + *
> + *    |x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> + *    |z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> + *
> + * This helper will return the following 2 32-bit components with
> the 16-bit
> + * values shuffled:
> + *
> + *    |x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> + *    |z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> + *
> + * For unshuffle, the example would be the opposite, a 64-bit type
> source
> + * and a 32-bit destination. A 2 component .xy 64-bit vector on SIMD8
> + * would be:
> + *
> + *    | x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> + *    | x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> + *    | y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> + *    | y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> + *
> + * The returned result would be the following 4 32-bit components
> unshuffled:
> + *
> + *    | x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> + *    | x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> + *    | y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> + *    | y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> + *
> + * - Source and destination register must not be overlapped.
> + * - components units are measured in terms of the smaller type between
> + *   source and destination because we are un/shuffling the smaller
> + *   components from/into the bigger ones.
> + * - first_component parameter allows skipping source components.
> + */
> +void
> +shuffle_src_to_dst(const fs_builder ,
> +                   const fs_reg ,
> +                   const fs_reg ,
> +                   uint32_t first_component,
> +                   uint32_t components)
> +{
> +   if (type_sz(src.type) == type_sz(dst.type)) {
> +      assert(!regions_overlap(dst,
> +         type_sz(dst.type) * bld.dispatch_width() * components,
> +         offset(src, bld, first_component),
> +         type_sz(src.type) * bld.dispatch_width() * components));
> +      for (unsigned i = 0; i < components; i++) {
> +         bld.MOV(retype(offset(dst, bld, i), src.type),
> +  

Re: [Mesa-dev] [PATCH 09/14] intel/compiler: Use shuffle_from_32bit_read at VS load_input

2018-06-14 Thread Chema Casanova
I've forgot to Cc: the mailing list.

On 15/06/18 01:54, Chema Casanova wrote:
> On 14/06/18 03:36, Jason Ekstrand wrote:
>> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
>> mailto:jmcasan...@igalia.com>> wrote:
>>
>> shuffle_from_32bit_read manages 32-bit reads to 32-bit destination
>> in the same way that the previous loop so now we just call the new
>> function for all bitsizes, simplifying also the 64-bit load_input.
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index 6abc7c0174d..fedf3bf5a83 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -2483,16 +2483,8 @@ fs_visitor::nir_emit_vs_intrinsic(const
>> fs_builder ,
>>        if (type_sz(dest.type) == 8)
>>           first_component /= 2;
>>
>> -      for (unsigned j = 0; j < num_components; j++) {
>> -         bld.MOV(offset(dest, bld, j), offset(src, bld, j +
>> first_component));
>> -      }
>> -
>> -      if (type_sz(dest.type) == 8) {
>> -         shuffle_32bit_load_result_to_64bit_data(bld,
>> -                                                 dest,
>> -                                                 retype(dest,
>> BRW_REGISTER_TYPE_F),
>> -                                               
>>  instr->num_components);
>> -      }
>> +      shuffle_from_32bit_read(bld, dest, retype(src,
>> BRW_REGISTER_TYPE_D),
>> +                              first_component, num_components);
>>
>>
>> I think this is ok.  It makes me a bit nervous to use
>> shuffle_from_32bit_read on the address register file  However, since
>> we're only doing it when type_sz(dst.type) >= 4, it should be ok.
> 
> And as I am going to implement same size shuffle to the same code of
> MOVs we are removing here it would be as safe at is now.
> 
>> If we want 16-bit attributes (Yeah, I know, I need to review that...) then we
>> may need to first copy from the ATTR file into a temp.  Maybe drop a
>> comment to that effect?
> 
> All the logic from "i965/fs: Unpack 16-bit from 32-bit components in VS
> load_input" [1] is already implemented here with the new shuffle, so
> 16-bit VS load_input changes would be already implemented with the
> refactoring, (there are 4 other patches needed to solve the issues about
> the vertex buffer format, padding, etc).
> 
> [1] https://patchwork.freedesktop.org/patch/206476/
> 
> I'm including the following comment:
> 
> /* For 16-bit support maybe a temporary is needed to copy from the ATTR
> file */
> 
> I would need to find a test case that can expose this problem...
> 
> Whenever you feel with energy for reviewing 16-bit inputs outputs let me
> know and I'll send an updated/rebased version. But I'm have also pending
> the review of "i965/fs: Register allocator shouldn't use grf127 for
> sends dest (v2)" [2] :-)
> 
> [2] https://patchwork.freedesktop.org/patch/217811/
> 
> Thanks for the review,
> 
> Chema
> 
>> Reviewed-by: Jason Ekstrand > <mailto:ja...@jlekstrand.net>>
>>  
>>
>>        break;
>>     }
>>  
>> -- 
>> 2.17.1
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org <mailto:mesa-dev@lists.freedesktop.org>
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>
>>
>>
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 13/14] intel/compiler: use new shuffle_32bit_write for all 64-bit storage writes

2018-06-14 Thread Chema Casanova
On 14/06/18 03:44, Jason Ekstrand wrote:
> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 2521f3c001b..833fad4247a 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -2839,8 +2839,7 @@ fs_visitor::nir_emit_tcs_intrinsic(const
> fs_builder ,
>                  * for that.
>                  */
>                 unsigned channel = iter * 2 + i;
> -               fs_reg dest = shuffle_64bit_data_for_32bit_write(bld,
> -                  offset(value, bld, channel), 1);
> +               fs_reg dest = shuffle_for_32bit_write(bld, value,
> channel, 1);
> 
> 
> What happened to offsetting "value"?

Using channel as first_component in shuffle_for_32bit_write is
equivalent to offsetting value, and we save one line. :)

>  
> 
> 
>                 srcs[header_regs + (i + first_component) * 2] = dest;
>                 srcs[header_regs + (i + first_component) * 2 + 1] =
> @@ -3694,8 +3693,8 @@ fs_visitor::nir_emit_cs_intrinsic(const
> fs_builder ,
>        unsigned type_size = 4;
>        if (nir_src_bit_size(instr->src[0]) == 64) {
>           type_size = 8;
> -         val_reg = shuffle_64bit_data_for_32bit_write(bld,
> -            val_reg, instr->num_components);
> +         val_reg = shuffle_for_32bit_write(bld, val_reg, 0,
> +                                           instr->num_components);
>        }
> 
>        unsigned type_slots = type_size / 4;
> @@ -4236,8 +4235,8 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>               * iteration handle the rest.
>               */
>              num_components = MIN2(2, num_components);
> -            write_src = shuffle_64bit_data_for_32bit_write(bld,
> write_src,
> -                                                         
>  num_components);
> +            write_src = shuffle_for_32bit_write(bld, write_src, 0,
> +                                                num_components);
>           } else if (type_size < 4) {
>              assert(type_size == 2);
>              /* For 16-bit types we pack two consecutive values into
> a 32-bit
> @@ -4333,7 +4332,7 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>        unsigned num_components = instr->num_components;
>        unsigned first_component = nir_intrinsic_component(instr);
>        if (nir_src_bit_size(instr->src[0]) == 64) {
> -         src = shuffle_64bit_data_for_32bit_write(bld, src,
> num_components);
> +         src = shuffle_for_32bit_write(bld, src, 0, num_components);
>           num_components *= 2;
>        }
>  
> -- 
> 2.17.1
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
> 
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 07/14] intel/compiler: shuffle_from_32bit_read for 64-bit do_untyped_vector_read

2018-06-14 Thread Chema Casanova


On 14/06/18 03:26, Jason Ekstrand wrote:
> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> do_untyped_vector_read is used at load_ssbo and load_shared.
> 
> The previous MOVs are removed because shuffle_from_32bit_read
> can handle storing the shuffle results in the expected destination
> just using the proper offset.
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 7e738ade82e..780a9e228de 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -2434,16 +2434,8 @@ do_untyped_vector_read(const fs_builder ,
>                                                  BRW_PREDICATE_NONE);
> 
>           /* Shuffle the 32-bit load result into valid 64-bit data */
> -         const fs_reg packed_result = bld.vgrf(dest.type,
> iter_components);
> -         shuffle_32bit_load_result_to_64bit_data(
> -            bld, packed_result, read_result, iter_components);
> -
> -         /* Move each component to its destination */
> -         read_result = retype(read_result, BRW_REGISTER_TYPE_DF);
> -         for (int c = 0; c < iter_components; c++) {
> -            bld.MOV(offset(dest, bld, it * 2 + c),
> -                    offset(packed_result, bld, c));
> -         }
> 
> 
> I really don't know why we needed this extra set of MOVs.  They seem
> pretty pointless to me.  Maybe history?  In any case, this looks good.v-

I've just checked and there is not much history as the 64-bit code of
this function hasn't been changed since they landed. I think that the
logic was first shuffle and then move to the proper destination instead
of just shuffling to the final destination directly.

So maybe Iago remembers if there was any reason why...

> Reviewed-by: Jason Ekstrand  >
>  
> 
> +         shuffle_from_32bit_read(bld, offset(dest, bld, it * 2),
> +                                 read_result, 0, iter_components);
> 
>           bld.ADD(read_offset, read_offset, brw_imm_ud(16));
>        }
> -- 
> 2.17.1
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
> 
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 02/14] intel/compiler: new shuffle_for_32bit_write and shuffle_from_32bit_read

2018-06-14 Thread Chema Casanova
On 14/06/18 03:02, Jason Ekstrand wrote:
> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> These new shuffle functions deal with the shuffle/unshuffle operations
> needed for read/write operations using 32-bit components when the
> read/written components have a different bit-size (8, 16, 64-bits).
> Shuffle from 32-bit to 32-bit becomes a simple MOV.
> 
> As the new function shuffle_src_to_dst takes of doing a shuffle or an
> unshuffle based on the different type_sz of source an destination this
> generic functions work with any source/destination assuming that writes
> use a 32-bit destination or reads use a 32-bit source.
> 
> 
> I'm having a lot of trouble understanding this paragraph.  Would you
> mind rephrasing it?
>  

Sure, that English didn't compile:

"shuffle_src_to_dst takes care of doing a shuffle when source type is
smaller than destination type and an unshuffle when source type is
bigger than destination. So this new read/write functions just need
to call shuffle_src_to_dst assuming that writes use a 32-bit
destination and reads use a 32-bit source."

I included also this comment in the commit log:

"As shuffle_for_32bit_write/from_32bit_read components take components
in unit of source/destination types and shuffle_src_to_dst takes units
of the smallest type component we adjust the components and
first_component parameters."

> 
> To enable this new functions it is needed than there is no
> source/destination overlap in the case of shuffle_from_32bit_read.
> That never happens on shuffle_for_32bit_write as it allocates a new
> destination register as it was at shuffle_64bit_data_for_32bit_write.
> ---
>  src/intel/compiler/brw_fs.h       | 11 +
>  src/intel/compiler/brw_fs_nir.cpp | 38 +++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index faf51568637..779170ecc95 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -519,6 +519,17 @@ void shuffle_16bit_data_for_32bit_write(const
> brw::fs_builder ,
>                                          const fs_reg ,
>                                          uint32_t components);
> 
> +void shuffle_from_32bit_read(const brw::fs_builder ,
> +                             const fs_reg ,
> +                             const fs_reg ,
> +                             uint32_t first_component,
> +                             uint32_t components);
> +
> +fs_reg shuffle_for_32bit_write(const brw::fs_builder ,
> +                               const fs_reg ,
> +                               uint32_t first_component,
> +                               uint32_t components);
> +
>  fs_reg setup_imm_df(const brw::fs_builder ,
>                      double v);
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 1a9d3c41d1d..1f684149fd5 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -5454,6 +5454,44 @@ shuffle_src_to_dst(const fs_builder ,
>     }
>  }
> 
> +void
> +shuffle_from_32bit_read(const fs_builder ,
> +                        const fs_reg ,
> +                        const fs_reg ,
> +                        uint32_t first_component,
> +                        uint32_t components)
> +{
> +   assert(type_sz(src.type) == 4);
> +
> 
> 
> /* This function takes components in units of the destination type while
> shuffle_src_to_dst takes components in units of the smallest type */

Done.

> +   if (type_sz(dst.type) > 4) {
> +      assert(type_sz(dst.type) == 8);
> +      first_component *= 2;
> +      components *= 2;
> +   }
> +
> +   shuffle_src_to_dst(bld, dst, src, first_component, components);
> +}
> +
> +fs_reg
> +shuffle_for_32bit_write(const fs_builder ,
> +                        const fs_reg ,
> +                        uint32_t first_component,
> +                        uint32_t components)
> +{
> +   fs_reg dst = bld.vgrf(BRW_REGISTER_TYPE_D,
> +                         DIV_ROUND_UP (components *
> type_sz(src.type), 4));
> +
> 
> 
> /* This function takes components in units of the source type while
> shuffle_src_to_dst takes components in units of the smallest type */

Done.

> With those added and the commit message re-worded a bit,
> 
> Reviewed-by: Jason Ekstrand  >

Thanks for the review.

Chema

> +   if (type_sz(src.type) > 4) {
> +      assert(type_sz(src.type) == 8);
> +      first_component *= 2;
> +      components *= 2;
> +   }
> +
> +   shuffle_src_to_dst(bld, dst, src, first_component, components);
> +
> 

Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-14 Thread Chema Casanova
El 14/06/18 a las 02:46, Jason Ekstrand escribió:
> On Wed, Jun 13, 2018 at 5:07 PM, Chema Casanova  <mailto:jmcasan...@igalia.com>> wrote:
> 
> On 13/06/18 22:46, Jason Ekstrand wrote:
> > On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> > mailto:jmcasan...@igalia.com>
> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
> >
> >     This new function takes care of shuffle/unshuffle components of a
> >     particular bit-size in components with a different bit-size.
> >
> >     If source type size is smaller than destination type size the
> operation
> >     needed is a component shuffle. The opposite case would be an
> unshuffle.
> >
> >     The operation allows to skip first_component number of
> components from
> >     the source.
> >
> >     Shuffle MOVs are retyped using integer types avoiding problems
> with
> >     denorms
> >     and float types. This allows to simplify uses of shuffle functions
> >     that are
> >     dealing with these retypes individually.
> >
> >     Now there is a new restriction so source and destination can
> not overlap
> >     anymore when calling this suffle function. Following patches that
> >     migrate
> >     to use this new function will take care individually of
> avoiding source
> >     and destination overlaps.
> >     ---
> >      src/intel/compiler/brw_fs_nir.cpp | 92
> +++
> >      1 file changed, 92 insertions(+)
> >
> >     diff --git a/src/intel/compiler/brw_fs_nir.cpp
> >     b/src/intel/compiler/brw_fs_nir.cpp
> >     index 166da0aa6d7..1a9d3c41d1d 100644
> >     --- a/src/intel/compiler/brw_fs_nir.cpp
> >     +++ b/src/intel/compiler/brw_fs_nir.cpp
> >     @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> >     fs_builder ,
> >         }
> >      }
> >
> >     +/*
> >     + * This helper takes a source register and un/shuffles it
> into the
> >     destination
> >     + * register.
> >     + *
> >     + * If source type size is smaller than destination type size the
> >     operation
> >     + * needed is a component shuffle. The opposite case would be an
> >     unshuffle. If
> >     + * source/destination type size is equal a shuffle is done that
> >     would be
> >     + * equivalent to a simple MOV.
> >
> >
> > There's a sticky bit here if we want this to work with 64-bit types on
> > gen7 and earlier because we only have DF there and not Q so the
> > brw_reg_type_from_bit_size below doesn't work.  If we care about that
> > case (and I'm not convinced we do), it should be easy enough to add a
> > type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
> > source to dest.
> 
> At this moment, current uses of this function are to read from 32-bits
> or to write to 32-bit. But I think that for completeness if would be
> nice to have all cases covered. The option of doing the MOVs in the case
> of equality (that would be quite normal) saves us to do the shuffle
> calculus for the simple case. So I'm going for it.
> 
> >     + *
> >     + * For example, if source is a 16-bit type and destination is
> >     32-bit. A 3
> >     + * components .xyz 16-bit vector on SIMD8 would be.
> >     + *
> >     + *    |x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> >     + *    |z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> >     + *
> >     + * This helper will return the following 2 32-bit components with
> >     the 16-bit
> >     + * values shuffled:
> >     + *
> >     + *    |x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> >     + *    |z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> >     + *
> >     + * For unshuffle, the example would be the opposite, a 64-bit
> type
> >     source
> >     + * and a 32-bit destination. A 2 component .xy 64-bit vector
> on SIMD8
> >     + * would be:
> >     + *
> >     + *    | x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> >     + *    | x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> >     + *    | y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h 

Re: [Mesa-dev] [PATCH 01/14] intel/compiler: general 8/16/32/64-bit shuffle_src_to_dst function

2018-06-13 Thread Chema Casanova
On 13/06/18 22:46, Jason Ekstrand wrote:
> On Sat, Jun 9, 2018 at 4:13 AM, Jose Maria Casanova Crespo
> mailto:jmcasan...@igalia.com>> wrote:
> 
> This new function takes care of shuffle/unshuffle components of a
> particular bit-size in components with a different bit-size.
> 
> If source type size is smaller than destination type size the operation
> needed is a component shuffle. The opposite case would be an unshuffle.
> 
> The operation allows to skip first_component number of components from
> the source.
> 
> Shuffle MOVs are retyped using integer types avoiding problems with
> denorms
> and float types. This allows to simplify uses of shuffle functions
> that are
> dealing with these retypes individually.
> 
> Now there is a new restriction so source and destination can not overlap
> anymore when calling this suffle function. Following patches that
> migrate
> to use this new function will take care individually of avoiding source
> and destination overlaps.
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 92 +++
>  1 file changed, 92 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 166da0aa6d7..1a9d3c41d1d 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -5362,6 +5362,98 @@ shuffle_16bit_data_for_32bit_write(const
> fs_builder ,
>     }
>  }
> 
> +/*
> + * This helper takes a source register and un/shuffles it into the
> destination
> + * register.
> + *
> + * If source type size is smaller than destination type size the
> operation
> + * needed is a component shuffle. The opposite case would be an
> unshuffle. If
> + * source/destination type size is equal a shuffle is done that
> would be
> + * equivalent to a simple MOV.
> 
> 
> There's a sticky bit here if we want this to work with 64-bit types on
> gen7 and earlier because we only have DF there and not Q so the
> brw_reg_type_from_bit_size below doesn't work.  If we care about that
> case (and I'm not convinced we do), it should be easy enough to add a
> type_sz(src.type) == type_sz(dst.type) case which just does MOVs from
> source to dest.

At this moment, current uses of this function are to read from 32-bits
or to write to 32-bit. But I think that for completeness if would be
nice to have all cases covered. The option of doing the MOVs in the case
of equality (that would be quite normal) saves us to do the shuffle
calculus for the simple case. So I'm going for it.

> + *
> + * For example, if source is a 16-bit type and destination is
> 32-bit. A 3
> + * components .xyz 16-bit vector on SIMD8 would be.
> + *
> + *    |x1|x2|x3|x4|x5|x6|x7|x8|y1|y2|y3|y4|y5|y6|y7|y8|
> + *    |z1|z2|z3|z4|z5|z6|z7|z8|  |  |  |  |  |  |  |  |
> + *
> + * This helper will return the following 2 32-bit components with
> the 16-bit
> + * values shuffled:
> + *
> + *    |x1 y1|x2 y2|x3 y3|x4 y4|x5 y5|x6 y6|x7 y7|x8 y8|
> + *    |z1   |z2   |z3   |z4   |z5   |z6   |z7   |z8   |
> + *
> + * For unshuffle, the example would be the opposite, a 64-bit type
> source
> + * and a 32-bit destination. A 2 component .xy 64-bit vector on SIMD8
> + * would be:
> + *
> + *    | x1l   x1h | x2l   x2h | x3l   x3h | x4l   x4h |
> + *    | x5l   x5h | x6l   x6h | x7l   x7h | x8l   x8h |
> + *    | y1l   y1h | y2l   y2h | y3l   y3h | y4l   y4h |
> + *    | y5l   y5h | y6l   y6h | y7l   y7h | y8l   y8h |
> + *
> + * The returned result would be the following 4 32-bit components
> unshuffled:
> + *
> + *    | x1l | x2l | x3l | x4l | x5l | x6l | x7l | x8l |
> + *    | x1h | x2h | x3h | x4h | x5h | x6h | x7h | x8h |
> + *    | y1l | y2l | y3l | y4l | y5l | y6l | y7l | y8l |
> + *    | y1h | y2h | y3h | y4h | y5h | y6h | y7h | y8h |
> + *
> + * - Source and destination register must not be overlapped.
> + * - first_component parameter allows skipping source components.
> + */
> +void
> +shuffle_src_to_dst(const fs_builder ,
> +                   const fs_reg ,
> +                   const fs_reg ,
> +                   uint32_t first_component,
> +                   uint32_t components)
> +{
> +   if (type_sz(src.type) <= type_sz(dst.type)) {
> +      /* Source is shuffled into destination */
> +      unsigned size_ratio = type_sz(dst.type) / type_sz(src.type);
> +#ifndef NDEBUG
> +      boolean src_dst_overlap = regions_overlap(dst,
> +         type_sz(dst.type) * bld.dispatch_width() * components,
> +         offset(src, bld, first_component * size_ratio),
> 
> 
> Why do you need to multiply first_component by size_ratio?  It's already
> in units of source components.

Yes, that's wrong. I forgot to 

Re: [Mesa-dev] [PATCH v2] nir/print: fix printing of 8/16 bit constant variables

2018-05-29 Thread Chema Casanova


El 29/05/18 a las 02:14, Karol Herbst escribió:
> v2 (Chema Casanova ): add float16 support
> 
> Signed-off-by: Karol Herbst 
> ---
>  src/compiler/nir/nir_print.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/src/compiler/nir/nir_print.c b/src/compiler/nir/nir_print.c
> index 97b2d6164cd..bb2d4e52067 100644
> --- a/src/compiler/nir/nir_print.c
> +++ b/src/compiler/nir/nir_print.c
> @@ -299,6 +299,28 @@ print_constant(nir_constant *c, const struct glsl_type 
> *type, print_state *state
> unsigned i, j;
>  
> switch (glsl_get_base_type(type)) {
> +   case GLSL_TYPE_UINT8:
> +   case GLSL_TYPE_INT8:
> +  /* Only float base types can be matrices. */
> +  assert(cols == 1);
> +
> +  for (i = 0; i < rows; i++) {
> + if (i > 0) fprintf(fp, ", ");
> + fprintf(fp, "0x%02x", c->values[0].u8[i]);
> +  }
> +  break;
> +
> +   case GLSL_TYPE_UINT16:
> +   case GLSL_TYPE_INT16:
> +  /* Only float base types can be matrices. */
> +  assert(cols == 1);
> +
> +  for (i = 0; i < rows; i++) {
> + if (i > 0) fprintf(fp, ", ");
> + fprintf(fp, "0x%04x", c->values[0].u16[i]);
> +  }
> +  break;
> +
> case GLSL_TYPE_UINT:
> case GLSL_TYPE_INT:
> case GLSL_TYPE_BOOL:
> @@ -311,6 +333,15 @@ print_constant(nir_constant *c, const struct glsl_type 
> *type, print_state *state
>}
>break;
>  
> +   case GLSL_TYPE_FLOAT16:
> +  for (i = 0; i < cols; i++) {
> + for (j = 0; j < rows; j++) {
> +if (i + j > 0) fprintf(fp, ", ");
> +fprintf(fp, "%f", _mesa_half_to_float(c->values[i].u16[i]));

It should be:

 fprintf(fp, "%f", _mesa_half_to_float(c->values[i].u16[j]));

With that fixed.

Reviewed-by: Jose Maria Casanova Crespo 


> + }
> +  }
> +  break;
> +
> case GLSL_TYPE_FLOAT:
>for (i = 0; i < cols; i++) {
>   for (j = 0; j < rows; j++) {
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] nir/print: fix printing of 8/16 bit constant variables

2018-05-21 Thread Chema Casanova
As GLSL_TYPE_FLOAT16 type support is not implemented in this patch, we
would need to change commit summary to ".. 8/16 bit integer constant.."
or just implement half float support with something like.

+   case GLSL_TYPE_FLOAT16:
+  for (i = 0; i < cols; i++) {
+ for (j = 0; j < rows; j++) {
+if (i + j > 0) fprintf(fp, ", ");
+fprintf(fp, "%f",_mesa_half_to_float(c->values[i].u16[j]));
+ }
+  }
+  break;


Chema

El 21/05/18 a las 14:51, Karol Herbst escribió:
> Signed-off-by: Karol Herbst 
> ---
>  src/compiler/nir/nir_print.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/src/compiler/nir/nir_print.c b/src/compiler/nir/nir_print.c
> index 97b2d6164cd..e331a26d932 100644
> --- a/src/compiler/nir/nir_print.c
> +++ b/src/compiler/nir/nir_print.c
> @@ -299,6 +299,28 @@ print_constant(nir_constant *c, const struct glsl_type 
> *type, print_state *state
> unsigned i, j;
>  
> switch (glsl_get_base_type(type)) {
> +   case GLSL_TYPE_UINT8:
> +   case GLSL_TYPE_INT8:
> +  /* Only float base types can be matrices. */
> +  assert(cols == 1);
> +
> +  for (i = 0; i < rows; i++) {
> + if (i > 0) fprintf(fp, ", ");
> + fprintf(fp, "0x%02x", c->values[0].u8[i]);
> +  }
> +  break;
> +
> +   case GLSL_TYPE_UINT16:
> +   case GLSL_TYPE_INT16:
> +  /* Only float base types can be matrices. */
> +  assert(cols == 1);
> +
> +  for (i = 0; i < rows; i++) {
> + if (i > 0) fprintf(fp, ", ");
> + fprintf(fp, "0x%04x", c->values[0].u16[i]);
> +  }
> +  break;
> +
> case GLSL_TYPE_UINT:
> case GLSL_TYPE_INT:
> case GLSL_TYPE_BOOL:
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/eu: Set EXECUTE_1 when setting the rounding mode in cr0

2018-05-21 Thread Chema Casanova
Thanks for fixing the full overwrite of the Control Register.

Reviewed-by: Jose Maria Casanova Crespo 

El 19/05/18 a las 05:09, Jason Ekstrand escribió:
> Fixes: d6cd14f2131a5b "i965/fs: Define new shader opcode to..."
> ---
>  src/intel/compiler/brw_eu_emit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_eu_emit.c 
> b/src/intel/compiler/brw_eu_emit.c
> index 6c9dced..4f51d51 100644
> --- a/src/intel/compiler/brw_eu_emit.c
> +++ b/src/intel/compiler/brw_eu_emit.c
> @@ -3716,6 +3716,7 @@ brw_rounding_mode(struct brw_codegen *p,
> if (bits != BRW_CR0_RND_MODE_MASK) {
>brw_inst *inst = brw_AND(p, brw_cr0_reg(0), brw_cr0_reg(0),
> brw_imm_ud(~BRW_CR0_RND_MODE_MASK));
> +  brw_inst_set_exec_size(p->devinfo, inst, BRW_EXECUTE_1);
>  
>/* From the Skylake PRM, Volume 7, page 760:
> *  "Implementation Restriction on Register Access: When the control
> @@ -3730,6 +3731,7 @@ brw_rounding_mode(struct brw_codegen *p,
> if (bits) {
>brw_inst *inst = brw_OR(p, brw_cr0_reg(0), brw_cr0_reg(0),
>brw_imm_ud(bits));
> +  brw_inst_set_exec_size(p->devinfo, inst, BRW_EXECUTE_1);
>brw_inst_set_thread_control(p->devinfo, inst, BRW_THREAD_SWITCH);
> }
>  }
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] intel/compiler: make brw_reg_type_from_bit_size usable from other places

2018-05-16 Thread Chema Casanova


El 15/05/18 a las 13:05, Iago Toral Quiroga escribió:
> This was private to brw_fs_nir.cpp but we are going to need it soon in
> brw_fs.cpp, so move it there and make it available to other files as we
> do for other utility functions.
> ---
>  src/intel/compiler/brw_fs.cpp | 59 
> +++
>  src/intel/compiler/brw_fs.h   |  4 +++
>  src/intel/compiler/brw_fs_nir.cpp | 59 
> ---
>  3 files changed, 63 insertions(+), 59 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index dcba4ee8068..458c534c9c7 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -900,6 +900,65 @@ fs_inst::size_read(int arg) const
> return 0;
>  }
>  
> +/*
> + * Returns a type based on a reference_type (word, float, half-float) and a
> + * given bit_size.
> + *
> + * Reference BRW_REGISTER_TYPE are HF,F,DF,W,D,UW,UD.
> + *
> + * @FIXME: 64-bit return types are always DF on integer types to maintain
> + * compability with uses of DF previously to the introduction of int64
> + * support.
> + */

This FIXME comment doesn't apply to current code so it can be removed.

With that:

Reviewed-by: Jose Maria Casanova Crespo 


> +brw_reg_type
> +brw_reg_type_from_bit_size(const unsigned bit_size,
> +   const brw_reg_type reference_type)
> +{
> +   switch(reference_type) {
> +   case BRW_REGISTER_TYPE_HF:
> +   case BRW_REGISTER_TYPE_F:
> +   case BRW_REGISTER_TYPE_DF:
> +  switch(bit_size) {
> +  case 16:
> + return BRW_REGISTER_TYPE_HF;
> +  case 32:
> + return BRW_REGISTER_TYPE_F;
> +  case 64:
> + return BRW_REGISTER_TYPE_DF;
> +  default:
> + unreachable("Invalid bit size");
> +  }
> +   case BRW_REGISTER_TYPE_W:
> +   case BRW_REGISTER_TYPE_D:
> +   case BRW_REGISTER_TYPE_Q:
> +  switch(bit_size) {
> +  case 16:
> + return BRW_REGISTER_TYPE_W;
> +  case 32:
> + return BRW_REGISTER_TYPE_D;
> +  case 64:
> + return BRW_REGISTER_TYPE_Q;
> +  default:
> + unreachable("Invalid bit size");
> +  }
> +   case BRW_REGISTER_TYPE_UW:
> +   case BRW_REGISTER_TYPE_UD:
> +   case BRW_REGISTER_TYPE_UQ:
> +  switch(bit_size) {
> +  case 16:
> + return BRW_REGISTER_TYPE_UW;
> +  case 32:
> + return BRW_REGISTER_TYPE_UD;
> +  case 64:
> + return BRW_REGISTER_TYPE_UQ;
> +  default:
> + unreachable("Invalid bit size");
> +  }
> +   default:
> +  unreachable("Unknown type");
> +   }
> +}
> +
>  namespace {
> /* Return the subset of flag registers that an instruction could
>  * potentially read or write based on the execution controls and flag
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index e384db809dc..c4d5ebee239 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -525,4 +525,8 @@ fs_reg setup_imm_df(const brw::fs_builder ,
>  enum brw_barycentric_mode brw_barycentric_mode(enum glsl_interp_mode mode,
> nir_intrinsic_op op);
>  
> +brw_reg_type
> +brw_reg_type_from_bit_size(const unsigned bit_size,
> +   const brw_reg_type reference_type);
> +
>  #endif /* BRW_FS_H */
> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
> b/src/intel/compiler/brw_fs_nir.cpp
> index 58ddc456bae..490fd4a0461 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -260,65 +260,6 @@ fs_visitor::nir_emit_system_values()
> }
>  }
>  
> -/*
> - * Returns a type based on a reference_type (word, float, half-float) and a
> - * given bit_size.
> - *
> - * Reference BRW_REGISTER_TYPE are HF,F,DF,W,D,UW,UD.
> - *
> - * @FIXME: 64-bit return types are always DF on integer types to maintain
> - * compability with uses of DF previously to the introduction of int64
> - * support.
> - */
> -static brw_reg_type
> -brw_reg_type_from_bit_size(const unsigned bit_size,
> -   const brw_reg_type reference_type)
> -{
> -   switch(reference_type) {
> -   case BRW_REGISTER_TYPE_HF:
> -   case BRW_REGISTER_TYPE_F:
> -   case BRW_REGISTER_TYPE_DF:
> -  switch(bit_size) {
> -  case 16:
> - return BRW_REGISTER_TYPE_HF;
> -  case 32:
> - return BRW_REGISTER_TYPE_F;
> -  case 64:
> - return BRW_REGISTER_TYPE_DF;
> -  default:
> - unreachable("Invalid bit size");
> -  }
> -   case BRW_REGISTER_TYPE_W:
> -   case BRW_REGISTER_TYPE_D:
> -   case BRW_REGISTER_TYPE_Q:
> -  switch(bit_size) {
> -  case 16:
> - return BRW_REGISTER_TYPE_W;
> -  case 32:
> - return BRW_REGISTER_TYPE_D;
> -  case 64:
> - return BRW_REGISTER_TYPE_Q;
> -  default:
> - unreachable("Invalid bit size");
> -  }
> -   case BRW_REGISTER_TYPE_UW:
> -   

Re: [Mesa-dev] [PATCH 2/2] i965/fs: Register allocator shoudn't use grf127 for sends dest (v2)

2018-05-04 Thread Chema Casanova
This patch is still pending of review.

Adding also Cc: to stable as if fixes some cts issues.

Chema


El 19/04/18 a las 02:38, Jose Maria Casanova Crespo escribió:
> Since Gen8+ Intel PRM states that "r127 must not be used for return
> address when there is a src and dest overlap in send instruction."
> 
> This patch implements this restriction creating new grf127_send_hack_node
> at the register allocator. This node has a fixed assignation to grf127.
> 
> For vgrf that are used as destination of send messages we create node
> interfereces with the grf127_send_hack_node. So the register allocator
> will never assign to these vgrf a register that involves grf127.
> 
> If dispatch_width > 8 we don't create these interferences to the because
> all instructions have node interferences between sources and destination.
> That is enough to avoid the r127 restriction.
> 
> This fixes CTS tests that raised this issue as they were executed as SIMD8:
>   
> dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_16struct_to_32struct.uniform_buffer_block_vert
>   
> dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_16struct_to_32struct.uniform_buffer_block_tessc
> 
> Shader-db results on Skylake:
>total instructions in shared programs: 7686798 -> 7686797 (<.01%)
>instructions in affected programs: 301 -> 300 (-0.33%)
>helped: 1
>HURT: 0
> 
>total cycles in shared programs: 337092322 -> 337091919 (<.01%)
>cycles in affected programs: 22420415 -> 22420012 (<.01%)
>helped: 712
>HURT: 588
> 
> Shader-db results on Broadwell:
> 
>total instructions in shared programs: 7658574 -> 7658625 (<.01%)
>instructions in affected programs: 19610 -> 19661 (0.26%)
>helped: 3
>HURT: 4
> 
>total cycles in shared programs: 340694553 -> 340676378 (<.01%)
>cycles in affected programs: 24724915 -> 24706740 (-0.07%)
>helped: 998
>HURT: 916
> 
>total spills in shared programs: 4300 -> 4311 (0.26%)
>spills in affected programs: 333 -> 344 (3.30%)
>helped: 1
>HURT: 3
> 
>total fills in shared programs: 5370 -> 5378 (0.15%)
>fills in affected programs: 274 -> 282 (2.92%)
>helped: 1
>HURT: 3
> 
> v2: Avoid duplicating register classes without grf127. Let's use a node
> with a fixed assignation to grf127 and create interferences to send
> message vgrf destinations. (Eric Anholt)
> ---
>  src/intel/compiler/brw_fs_reg_allocate.cpp | 25 ++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp 
> b/src/intel/compiler/brw_fs_reg_allocate.cpp
> index ec8e116cb38..59e047483c0 100644
> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp
> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
> @@ -548,6 +548,9 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
> int first_mrf_hack_node = node_count;
> if (devinfo->gen >= 7)
>node_count += BRW_MAX_GRF - GEN7_MRF_HACK_START;
> +   int grf127_send_hack_node = node_count;
> +   if (devinfo->gen >= 8 && dispatch_width == 8)
> +  node_count ++;
> struct ra_graph *g =
>ra_alloc_interference_graph(compiler->fs_reg_sets[rsi].regs, 
> node_count);
>  
> @@ -653,6 +656,28 @@ fs_visitor::assign_regs(bool allow_spilling, bool 
> spill_all)
>}
> }
>  
> +   if (devinfo->gen >= 8 && dispatch_width == 8) {
> +  /* At Intel Broadwell PRM, vol 07, section "Instruction Set Reference",
> +   * subsection "EUISA Instructions", Send Message (page 990):
> +   *
> +   * "r127 must not be used for return address when there is a src and
> +   * dest overlap in send instruction."
> +   *
> +   * We are avoiding using grf127 as part of the destination of send
> +   * messages adding a node interference to the grf127_send_hack_node.
> +   * This node has a fixed asignment to grf127.
> +   *
> +   * We don't apply it to SIMD16 because previous code avoids any 
> register
> +   * overlap between sources and destination.
> +   */
> +  ra_set_node_reg(g, grf127_send_hack_node, 127);
> +  foreach_block_and_inst(block, fs_inst, inst, cfg) {
> + if (inst->is_send_from_grf() && inst->dst.file == VGRF) {
> +ra_add_node_interference(g, inst->dst.nr, grf127_send_hack_node);
> + }
> +  }
> +   }
> +
> /* Debug of register spilling: Go spill everything. */
> if (unlikely(spill_all)) {
>int reg = choose_spill_reg(g);
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 07/18] intel/compiler: fix brw_negate_immediate for 16-bit types

2018-05-02 Thread Chema Casanova


El 30/04/18 a las 23:14, Jason Ekstrand escribió:
> 
> 
> On Mon, Apr 30, 2018 at 7:18 AM, Iago Toral Quiroga  > wrote:
> 
> From: Jose Maria Casanova Crespo  >
> 
> From Intel Skylake PRM, vol 07, "Immediate" section (page 768):
> 
> "For a word, unsigned word, or half-float immediate data,
> software must replicate the same 16-bit immediate value to both
> the lower word and the high word of the 32-bit immediate field
> in a GEN instruction."
> 
> This patch implements float16 negate and fix the int16/uint16
> negate that wasn't taking into account the replication in lower
> and higher words.
> 
> 
> Since this fixes a bug, do we want to split it in two and send the
> bug-fix to stable?

Makes sense to split. I'm going to send for stable also the brw_imm_w patch.

I detected the same issue with brw_abs_immediate. So I'm including it in
the v3 of this patch.


> 
> v2: Integer cases are different to Float cases. (Jason Ekstrand)
>     Included reference to PRM (Jose Maria Casanova)
> ---
>  src/intel/compiler/brw_shader.cpp | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_shader.cpp
> b/src/intel/compiler/brw_shader.cpp
> index 9cdf9fcb23..76dd1173fa 100644
> --- a/src/intel/compiler/brw_shader.cpp
> +++ b/src/intel/compiler/brw_shader.cpp
> @@ -580,8 +580,13 @@ brw_negate_immediate(enum brw_reg_type type,
> struct brw_reg *reg)
>        reg->d = -reg->d;
>        return true;
>     case BRW_REGISTER_TYPE_W:
> -   case BRW_REGISTER_TYPE_UW:
> -      reg->d = -(int16_t)reg->ud;
> +   case BRW_REGISTER_TYPE_UW: {
> +      uint16_t value = -(int16_t)reg->ud;
> +      reg->ud = value | value << 16;
> 
> 
> You're shifting an explicitly 16-bit value by 16.  I think you want to
> cast to uint32_t.

As agreed I'll change this for:

reg->ud = value | (uint32_t) value << 16;


> +      return true;
> +   }
> +   case BRW_REGISTER_TYPE_HF:
> +      reg->ud ^= 0x80008000;
>        return true;
>     case BRW_REGISTER_TYPE_F:
>        reg->f = -reg->f;
> @@ -602,8 +607,6 @@ brw_negate_immediate(enum brw_reg_type type,
> struct brw_reg *reg)
>     case BRW_REGISTER_TYPE_UV:
>     case BRW_REGISTER_TYPE_V:
>        assert(!"unimplemented: negate UV/V immediate");
> -   case BRW_REGISTER_TYPE_HF:
> -      assert(!"unimplemented: negate HF immediate");
>     case BRW_REGISTER_TYPE_NF:
>        unreachable("no NF immediates");
>     }
> -- 
> 2.14.1
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 06/18] intel/compiler: fix brw_imm_w for negative 16-bit integers

2018-05-02 Thread Chema Casanova


El 01/05/18 a las 01:22, Jason Ekstrand escribió:
> On Mon, Apr 30, 2018 at 3:53 PM, Chema Casanova <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>> wrote:
> 
> 
> 
> On 30/04/18 23:12, Jason Ekstrand wrote:
> > On Mon, Apr 30, 2018 at 7:18 AM, Iago Toral Quiroga <ito...@igalia.com 
> <mailto:ito...@igalia.com>
> > <mailto:ito...@igalia.com <mailto:ito...@igalia.com>>> wrote:
> > 
> >     From: Jose Maria Casanova Crespo <jmcasan...@igalia.com 
> <mailto:jmcasan...@igalia.com>
> >     <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>>
> >
> >     16-bit immediates need to replicate the 16-bit immediate value
> >     in both words of the 32-bit value. This needs to be careful
> >     to avoid sign-extension, which the previous implementation was
> >     not handling properly.
> >
> >     For example, with the previous implementation, storing the value
> >     -3 would generate imm.d = 0xfffd due to signed integer sign
> >     extension, which is not correct. Instead, we should cast to
> >     unsigned, which gives us the correct result: imm.ud = 0xfffdfffd.
> >
> >     We only had a couple of cases hitting this path in the driver
> >     until now, one with value -1, which would work since all bits are
> >     one in this case, and another with value -2 in brw_clip_tri(),
> >     which would hit the aforementioned issue (this case only affects
> >     gen4 although we are not aware of whether this was causing an
> >     actual bug somewhere).
> >     ---
> >      src/intel/compiler/brw_reg.h | 2 +-
> >      1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >     diff --git a/src/intel/compiler/brw_reg.h
> b/src/intel/compiler/brw_reg.h
> >     index dff9b970b2..0084a78af6 100644
> >     --- a/src/intel/compiler/brw_reg.h
> >     +++ b/src/intel/compiler/brw_reg.h
> >     @@ -705,7 +705,7 @@ static inline struct brw_reg
> >      brw_imm_w(int16_t w)
> >      {
> >         struct brw_reg imm = brw_imm_reg(BRW_REGISTER_TYPE_W);
> >     -   imm.d = w | (w << 16);
> >     +   imm.ud = (uint16_t)w | ((uint16_t)w << 16);
> 
> > Uh... Is this cast right?  Doing a << 16 on a 16-bit data type should
> > yield undefined results.  I think you want a (uint32_t) cast.
> 
> In my test code it was working at least with GCC, I think it is because
> at the end we have an integer promotion for any type with lower rank
> than int.
> 
> "Formally, the rule says (C11 6.3.1.1):
> 
>     If an int can represent all values of the original type (as
> restricted by the width, for a bit-field), the value is converted to an
> int; otherwise, it is converted to an unsigned int. These are called the
> integer promotions."
> 
> But I agree that is clearer if we just use (uint32_t).
> I can change also the brw_imm_uw case that has the same issue.
> 
> 
> Yeah, best to make it clear. :-)

I was wrong, we can't just replace (uint16_t) cast by (uint32_t) because
the cast from signed short to uint32_t implies sign extension, because
it seems that sign extensions is done if source is signed and not in
destination type.

So for example, being w = -2  (0xfffe).

imm.ud = (uint32_t)w | (uint32_t)w << 16;

becomes: 0xfffe

So the alternatives I figure out with the correct result are.

imm.ud = (uint32_t) w & 0x | (uint32_t)w << 16;

Or:

uint16_t value = w;
imm.ud = (uint32_t)value | (uint32_t)value << 16;

Or something like:

imm.ud = (uint32_t)(uint16_t)w | ((uint32_t)(uint16_t)w << 16);

Any preference?

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 06/18] intel/compiler: fix brw_imm_w for negative 16-bit integers

2018-04-30 Thread Chema Casanova


On 30/04/18 23:12, Jason Ekstrand wrote:
> On Mon, Apr 30, 2018 at 7:18 AM, Iago Toral Quiroga  > wrote:
> 
> From: Jose Maria Casanova Crespo  >
> 
> 16-bit immediates need to replicate the 16-bit immediate value
> in both words of the 32-bit value. This needs to be careful
> to avoid sign-extension, which the previous implementation was
> not handling properly.
> 
> For example, with the previous implementation, storing the value
> -3 would generate imm.d = 0xfffd due to signed integer sign
> extension, which is not correct. Instead, we should cast to
> unsigned, which gives us the correct result: imm.ud = 0xfffdfffd.
> 
> We only had a couple of cases hitting this path in the driver
> until now, one with value -1, which would work since all bits are
> one in this case, and another with value -2 in brw_clip_tri(),
> which would hit the aforementioned issue (this case only affects
> gen4 although we are not aware of whether this was causing an
> actual bug somewhere).
> ---
>  src/intel/compiler/brw_reg.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_reg.h b/src/intel/compiler/brw_reg.h
> index dff9b970b2..0084a78af6 100644
> --- a/src/intel/compiler/brw_reg.h
> +++ b/src/intel/compiler/brw_reg.h
> @@ -705,7 +705,7 @@ static inline struct brw_reg
>  brw_imm_w(int16_t w)
>  {
>     struct brw_reg imm = brw_imm_reg(BRW_REGISTER_TYPE_W);
> -   imm.d = w | (w << 16);
> +   imm.ud = (uint16_t)w | ((uint16_t)w << 16);

> Uh... Is this cast right?  Doing a << 16 on a 16-bit data type should
> yield undefined results.  I think you want a (uint32_t) cast.

In my test code it was working at least with GCC, I think it is because
at the end we have an integer promotion for any type with lower rank
than int.

"Formally, the rule says (C11 6.3.1.1):

If an int can represent all values of the original type (as
restricted by the width, for a bit-field), the value is converted to an
int; otherwise, it is converted to an unsigned int. These are called the
integer promotions."

But I agree that is clearer if we just use (uint32_t).
I can change also the brw_imm_uw case that has the same issue.

>     return imm;
>  }
>  
> -- 
> 2.14.1
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 08/11] intel/compiler: fix brw_negate_immediate for 16-bit types

2018-04-26 Thread Chema Casanova
El 24/04/18 a las 23:55, Jason Ekstrand escribió:
> On Wed, Apr 11, 2018 at 12:20 AM, Iago Toral Quiroga  > wrote:
> 
> From: Jose Maria Casanova Crespo  >
> 
> 16-bit immediates are replicated in each word of a 32-bit value
> so we need to negate both.
> ---
>  src/intel/compiler/brw_shader.cpp | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_shader.cpp
> b/src/intel/compiler/brw_shader.cpp
> index 9cdf9fcb23d..c7edc60b63d 100644
> --- a/src/intel/compiler/brw_shader.cpp
> +++ b/src/intel/compiler/brw_shader.cpp
> @@ -581,7 +581,8 @@ brw_negate_immediate(enum brw_reg_type type,
> struct brw_reg *reg)
>        return true;
>     case BRW_REGISTER_TYPE_W:
>     case BRW_REGISTER_TYPE_UW:
> -      reg->d = -(int16_t)reg->ud;
> +   case BRW_REGISTER_TYPE_HF:
> +      reg->ud ^= 0x80008000;
> 
> 
> This is not correct for integers.  We need to keep two separate cases.

That's true, I've forgotten about two's complement representation. For
this series v2 I will send:

case BRW_REGISTER_TYPE_UW:
-  reg->d = -(int16_t)reg->ud;
+  uint16_t value = -(int16_t)reg->ud;
+  reg->ud = value | value << 16;
+  return true;
+   case BRW_REGISTER_TYPE_HF:
+  reg->ud ^= 0x80008000;

I'm including for v2 also a new patch for solving a problem with
negative values at brw_imm_w that is replicating the 16-bit value
wrongly because of sign extension.

>  
> 
>        return true;
>     case BRW_REGISTER_TYPE_F:
>        reg->f = -reg->f;
> @@ -602,8 +603,6 @@ brw_negate_immediate(enum brw_reg_type type,
> struct brw_reg *reg)
>     case BRW_REGISTER_TYPE_UV:
>     case BRW_REGISTER_TYPE_V:
>        assert(!"unimplemented: negate UV/V immediate");
> -   case BRW_REGISTER_TYPE_HF:
> -      assert(!"unimplemented: negate HF immediate");
>     case BRW_REGISTER_TYPE_NF:
>        unreachable("no NF immediates");
>     }
> -- 
> 2.14.1
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
> 
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 06/11] nir/constant_folding: support 16-bit constants

2018-04-25 Thread Chema Casanova
El 24/04/18 a las 23:52, Jason Ekstrand escribió:
> On Wed, Apr 11, 2018 at 12:20 AM, Iago Toral Quiroga  > wrote:
> 
> From: Jose Maria Casanova Crespo  >
> 
> ---
>  src/compiler/nir/nir_opt_constant_folding.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/src/compiler/nir/nir_opt_constant_folding.c
> b/src/compiler/nir/nir_opt_constant_folding.c
> index d6be807b3dc..b63660ea4da 100644
> --- a/src/compiler/nir/nir_opt_constant_folding.c
> +++ b/src/compiler/nir/nir_opt_constant_folding.c
> @@ -78,6 +78,8 @@ constant_fold_alu_instr(nir_alu_instr *instr, void
> *mem_ctx)
>             j++) {
>           if (load_const->def.bit_size == 64)
>              src[i].u64[j] =
> load_const->value.u64[instr->src[i].swizzle[j]];
> +         else if (load_const->def.bit_size == 16)
> +            src[i].u16[j] =
> load_const->value.u16[instr->src[i].swizzle[j]];
>           else
>              src[i].u32[j] =
> load_const->value.u32[instr->src[i].swizzle[j]];
> 
> 
> Let's make this a switch and support 8 while we're at it.

Karol Herbst has just sent just a patch with these changes done. So I've
reviewed it as I got to the same patch.

> 
>        }
> -- 
> 2.14.1
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
> 
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 3/3] nir/opt_constant_folding: fix folding of 8 and 16 bit ints

2018-04-25 Thread Chema Casanova
I've already got to the same code addressing Jason feedback about
"[PATCH 06/11] nir/constant_folding: support 16-bit constants."

So this is:

Reviewed-by: Jose Maria Casanova Crespo 

El 25/04/18 a las 11:14, Karol Herbst escribió:
> Signed-off-by: Karol Herbst 
> ---
>  src/compiler/nir/nir_opt_constant_folding.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/src/compiler/nir/nir_opt_constant_folding.c 
> b/src/compiler/nir/nir_opt_constant_folding.c
> index d6be807b3dc..a848b145874 100644
> --- a/src/compiler/nir/nir_opt_constant_folding.c
> +++ b/src/compiler/nir/nir_opt_constant_folding.c
> @@ -76,10 +76,20 @@ constant_fold_alu_instr(nir_alu_instr *instr, void 
> *mem_ctx)
>  
>for (unsigned j = 0; j < nir_ssa_alu_instr_src_components(instr, i);
> j++) {
> - if (load_const->def.bit_size == 64)
> + switch(load_const->def.bit_size) {
> + case 64:
>  src[i].u64[j] = load_const->value.u64[instr->src[i].swizzle[j]];
> - else
> +break;
> + case 32:
>  src[i].u32[j] = load_const->value.u32[instr->src[i].swizzle[j]];
> +break;
> + case 16:
> +src[i].u16[j] = load_const->value.u16[instr->src[i].swizzle[j]];
> +break;
> + case 8:
> +src[i].u8[j] = load_const->value.u8[instr->src[i].swizzle[j]];
> +break;
> + }
>}
>  
>/* We shouldn't have any source modifiers in the optimization loop. */
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/3] nir: support converting to 8-bit integers in nir_type_conversion_op

2018-04-25 Thread Chema Casanova
Reviewed-by: Jose Maria Casanova Crespo 

El 25/04/18 a las 11:14, Karol Herbst escribió:
> Signed-off-by: Karol Herbst 
> ---
>  src/compiler/nir/nir_opcodes_c.py | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/src/compiler/nir/nir_opcodes_c.py 
> b/src/compiler/nir/nir_opcodes_c.py
> index c19185534af..8afccca9504 100644
> --- a/src/compiler/nir/nir_opcodes_c.py
> +++ b/src/compiler/nir/nir_opcodes_c.py
> @@ -62,7 +62,12 @@ nir_type_conversion_op(nir_alu_type src, nir_alu_type dst, 
> nir_rounding_mode rnd
>  % endif
>  %  endif
> switch (dst_bit_size) {
> -% for dst_bits in [16, 32, 64]:
> +% if dst_t == 'float':
> +<%bit_sizes = [16, 32, 64] %>
> +% else:
> +<%bit_sizes = [8, 16, 32, 64] %>
> +% endif
> +% for dst_bits in bit_sizes:
>case ${dst_bits}:
>  %if src_t == 'float' and dst_t == 'float' and dst_bits 
> == 16:
>   switch(rnd) {
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965/fs: retype offset_reg to UD at load_ssbo

2018-04-19 Thread Chema Casanova
On 19/04/18 19:50, Ian Romanick wrote:
> On 04/18/2018 01:57 PM, Jose Maria Casanova Crespo wrote:
>> All operations with offset_reg at do_vector_read are done
>> with UD type. So copy propagation was not working through
>> the generated MOVs:
>>
>> mov(8) vgrf9:UD, vgrf7:D
> 
> I have noticed other cases of this.  Copy propagation doesn't work
> across moves that change the type because int->float and float->int
> actually change the bit pattern.  unsigned->signed doesn't do anything,
> so it seems like we should allow that case.  This is a few steps down on
> my to-do list, but if someone else gets to it first...

I have a pending visit to copy propagation because of some 16-bit
strange behaviours. So I can put it also im my to-do and check if
allowing it doesn't generate any problem.

Chema

>> This change allows removing the MOV generated for reading the
>> first components for 16-bit and 64-bit ssbo reads with
>> non-constant offsets.
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index 6c4bcd1c113..0ebaab96634 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -4142,7 +4142,7 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>if (const_offset) {
>>   offset_reg = brw_imm_ud(const_offset->u32[0]);
>>} else {
>> - offset_reg = get_nir_src(instr->src[1]);
>> + offset_reg = retype(get_nir_src(instr->src[1]), 
>> BRW_REGISTER_TYPE_UD);
>>}
>>  
>>/* Read the vector */
>>
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/2] intel/compiler: grf127 can not be dest when src and dest overlap in send

2018-04-16 Thread Chema Casanova
On 15/04/18 08:55, Matt Turner wrote:
> On Wed, Apr 11, 2018 at 7:30 PM, Jose Maria Casanova Crespo
>  wrote:
>> Implement at brw_eu_validate the restriction from Intel Broadwell PRM, vol 
>> 07,
>> section "Instruction Set Reference", subsection "EUISA Instructions", Send
>> Message (page 990):
>>
>> "r127 must not be used for return address when there is a src and dest 
>> overlap
>> in send instruction."
>>
>> Cc: Jason Ekstrand 
>> Cc: Matt Turner 
>> ---
>>  src/intel/compiler/brw_eu_validate.c | 9 +
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu_validate.c 
>> b/src/intel/compiler/brw_eu_validate.c
>> index d3189d1ef5e..0d711501303 100644
>> --- a/src/intel/compiler/brw_eu_validate.c
>> +++ b/src/intel/compiler/brw_eu_validate.c
>> @@ -261,6 +261,15 @@ send_restrictions(const struct gen_device_info *devinfo,
>>brw_inst_src0_da_reg_nr(devinfo, inst) < 112,
>>"send with EOT must use g112-g127");
>>}
> 
> Put a newline here

Fixed locally.

>> +  if (devinfo->gen >= 8) {
>> + ERROR_IF(!dst_is_null(devinfo, inst) &&
>> +  (brw_inst_dst_da_reg_nr(devinfo, inst) +
>> +   brw_inst_rlen(devinfo, inst) > 127 ) &&
> 
> Remove the extra space after 127

Fixed locally.

>> +  (brw_inst_src0_da_reg_nr(devinfo, inst) +
>> +   brw_inst_mlen(devinfo, inst) >
>> +   brw_inst_dst_da_reg_nr(devinfo, inst)),
>> +  "r127 can not be dest when src and dest overlap in send");
> 
> I'd change the message to more closely match the docs:
> 
> "r127 must not be used for return address when there is a src and dest 
> overlap"
> 
> Thank you for extending the validator!
> 
> Reviewed-by: Matt Turner 

Thank you for the review.

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/6] i965: Add negative_equals methods

2018-03-23 Thread Chema Casanova


On 23/03/18 19:27, Matt Turner wrote:
> On Wed, Mar 21, 2018 at 5:58 PM, Ian Romanick  wrote:
>> From: Ian Romanick 
>>
>> This method is similar to the existing ::equals methods.  Instead of
>> testing that two src_regs are equal to each other, it tests that one is
>> the negation of the other.
>>
>> v2: Simplify various checks based on suggestions from Matt.  Use
>> src_reg::type instead of fixed_hw_reg.type in a check.  Also suggested
>> by Matt.
>>
>> v3: Rebase on 3 years.  Fix some problems with negative_equals with VF
>> constants.  Add fs_reg::negative_equals.
>>
>> Signed-off-by: Ian Romanick 
>> ---
>>  src/intel/compiler/brw_fs.cpp |  7 ++
>>  src/intel/compiler/brw_ir_fs.h|  1 +
>>  src/intel/compiler/brw_ir_vec4.h  |  1 +
>>  src/intel/compiler/brw_reg.h  | 46 
>> +++
>>  src/intel/compiler/brw_shader.cpp |  6 +
>>  src/intel/compiler/brw_shader.h   |  1 +
>>  src/intel/compiler/brw_vec4.cpp   |  7 ++
>>  7 files changed, 69 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>> index 6eea532..3d454c3 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -454,6 +454,13 @@ fs_reg::equals(const fs_reg ) const
>>  }
>>
>>  bool
>> +fs_reg::negative_equals(const fs_reg ) const
>> +{
>> +   return (this->backend_reg::negative_equals(r) &&
>> +   stride == r.stride);
>> +}
>> +
>> +bool
>>  fs_reg::is_contiguous() const
>>  {
>> return stride == 1;
>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>> index 54797ff..f06a33c 100644
>> --- a/src/intel/compiler/brw_ir_fs.h
>> +++ b/src/intel/compiler/brw_ir_fs.h
>> @@ -41,6 +41,7 @@ public:
>> fs_reg(enum brw_reg_file file, int nr, enum brw_reg_type type);
>>
>> bool equals(const fs_reg ) const;
>> +   bool negative_equals(const fs_reg ) const;
>> bool is_contiguous() const;
>>
>> /**
>> diff --git a/src/intel/compiler/brw_ir_vec4.h 
>> b/src/intel/compiler/brw_ir_vec4.h
>> index cbaff2f..95c5119 100644
>> --- a/src/intel/compiler/brw_ir_vec4.h
>> +++ b/src/intel/compiler/brw_ir_vec4.h
>> @@ -43,6 +43,7 @@ public:
>> src_reg(struct ::brw_reg reg);
>>
>> bool equals(const src_reg ) const;
>> +   bool negative_equals(const src_reg ) const;
>>
>> src_reg(class vec4_visitor *v, const struct glsl_type *type);
>> src_reg(class vec4_visitor *v, const struct glsl_type *type, int size);
>> diff --git a/src/intel/compiler/brw_reg.h b/src/intel/compiler/brw_reg.h
>> index 7ad144b..732bddf 100644
>> --- a/src/intel/compiler/brw_reg.h
>> +++ b/src/intel/compiler/brw_reg.h
>> @@ -255,6 +255,52 @@ brw_regs_equal(const struct brw_reg *a, const struct 
>> brw_reg *b)
>> return a->bits == b->bits && (df ? a->u64 == b->u64 : a->ud == b->ud);
>>  }
>>
>> +static inline bool
>> +brw_regs_negative_equal(const struct brw_reg *a, const struct brw_reg *b)
>> +{
>> +   if (a->file == IMM) {
>> +  if (a->bits != b->bits)
>> + return false;
>> +
>> +  switch (a->type) {
>> +  case BRW_REGISTER_TYPE_UQ:
>> +  case BRW_REGISTER_TYPE_Q:
>> + return a->d64 == -b->d64;
>> +  case BRW_REGISTER_TYPE_DF:
>> + return a->df == -b->df;
>> +  case BRW_REGISTER_TYPE_UD:
>> +  case BRW_REGISTER_TYPE_D:
>> + return a->d == -b->d;
>> +  case BRW_REGISTER_TYPE_F:
>> + return a->f == -b->f;
>> +  case BRW_REGISTER_TYPE_VF:
>> + /* It is tempting to treat 0 as a negation of 0 (and -0 as a 
>> negation
>> +  * of -0).  There are occasions where 0 or -0 is used and the exact
>> +  * bit pattern is desired.  At the very least, changing this to 
>> allow
>> +  * 0 as a negation of 0 causes some fp64 tests to fail on IVB.
>> +  */
>> + return a->ud == (b->ud ^ 0x80808080);
>> +  case BRW_REGISTER_TYPE_UW:
>> +  case BRW_REGISTER_TYPE_W:
>> +  case BRW_REGISTER_TYPE_UV:
>> +  case BRW_REGISTER_TYPE_V:
>> +  case BRW_REGISTER_TYPE_HF:
>> +  case BRW_REGISTER_TYPE_UB:
>> +  case BRW_REGISTER_TYPE_B:
> 
> There are no B/UB immediates, so you can move these to default. In
> fact, I'd get rid of the default so we'll get a warning if there are
> unhandled types. Probably the only one not already in the list is NF,
> which should also be unreachable.

> Returning false for unimplemented types seems fine. Immediates of
> those types are sufficiently rare that I don't expect this to catch
> anything, and in the rare occurrence that it does I wouldn't want the
> compiler to assert fail or do something undefined. Really I only
> expect HF to ever get hit, and only after we actually start using it.

According to PRM:

"For a word, unsigned word, or half-float immediate data, software
must replicate the same 16-bit immediate value to both the lower word
and the high word of the 

Re: [Mesa-dev] [PATCH] nir/search: Support 8 and 16-bit constants

2018-03-01 Thread Chema Casanova
I've been checking the whole nir_search.c and there is another pending
16-bit support in construct_value function. I'm sending a patch so feel
free to squash it to your if it makes sense.

In any case this is.

Reviewed-by: Jose Maria Casanova Crespo 

El 28/02/18 a las 22:18, Jason Ekstrand escribió:
> ---
>  src/compiler/nir/nir_search.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/src/compiler/nir/nir_search.c b/src/compiler/nir/nir_search.c
> index dec56fe..c7c52ae 100644
> --- a/src/compiler/nir/nir_search.c
> +++ b/src/compiler/nir/nir_search.c
> @@ -27,6 +27,7 @@
>  
>  #include 
>  #include "nir_search.h"
> +#include "util/half_float.h"
>  
>  struct match_state {
> bool inexact_match;
> @@ -194,6 +195,9 @@ match_value(const nir_search_value *value, nir_alu_instr 
> *instr, unsigned src,
>   for (unsigned i = 0; i < num_components; ++i) {
>  double val;
>  switch (load->def.bit_size) {
> +case 16:
> +   val = _mesa_half_to_float(load->value.u16[new_swizzle[i]]);
> +   break;
>  case 32:
> val = load->value.f32[new_swizzle[i]];
> break;
> @@ -213,6 +217,22 @@ match_value(const nir_search_value *value, nir_alu_instr 
> *instr, unsigned src,
>case nir_type_uint:
>case nir_type_bool32:
>   switch (load->def.bit_size) {
> + case 8:
> +for (unsigned i = 0; i < num_components; ++i) {
> +   if (load->value.u8[new_swizzle[i]] !=
> +   (uint8_t)const_val->data.u)
> +  return false;
> +}
> +return true;
> +
> + case 16:
> +for (unsigned i = 0; i < num_components; ++i) {
> +   if (load->value.u16[new_swizzle[i]] !=
> +   (uint16_t)const_val->data.u)
> +  return false;
> +}
> +return true;
> +
>   case 32:
>  for (unsigned i = 0; i < num_components; ++i) {
> if (load->value.u32[new_swizzle[i]] !=
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 0/8] anv: VK_KHR_16bit_storage enabling SSBO/UBO/PushConstant

2018-02-28 Thread Chema Casanova
On 28/02/18 18:02, Jason Ekstrand wrote:
> I think all the interesting patches are reviewed now.  All the boring
> "turn stuff on" patches are
> 
> Reviewed-by: Jason Ekstrand  >

Thanks a lot for the quick review of the series. One step less to enable
the VK_KHR_16bit_storage features completely.

Chema

> On Tue, Feb 27, 2018 at 5:27 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> This v2 series includes several fixes to allow enabling the
> VK_KHR_16bit_storage
> features in ANV that have already landed but are currently disabled.
> 
> Main differences with V1 [1] are:
> 
>    * Now UBO/SSBO padding for buffers size not multiple of 4 [1/8]
> is done
>      in isl and the calculus to get the original buffer size before
>      padding is done in the backend.
>    * Now load_ubo/ssbo [3/8] at constant offsets use
> untyped_surface_read
>      in all cases. A new patch [2/8] enables the
>      shuffle_32bit_load_result_to_16bit_data to skip components.
>    * vtn_type_block_size [6/8] has been simplified using
> glsl_get_bit_size.
> 
> Patches 2 and 3 and the re-enablement of features 5 and 8 are the
> ones with
> pending review.
> 
> The series includes the following fixes:
> 
>    * [1] Fixes issues in UBO/SSBO support when buffer size is not
> multiple
>      of 4. Patch adds a padding so the size will always include the
> last DWord
>      completelly. For unsized SSBO arrays there are some bits
> arithmetic to
>      allow recalculating the original size without the padding to
> calculate the
>      number of elements correctly.
>    * [2-4] Fixes the behaviour when VK_KHR_relaxed_block_layout is
> enabled, when
>      we can not guarantee that the surface read/write offsets are
> multiple of 4.
>    * [5] Enables VK_KHR_16bit_storage for SSBO and UBO.
>    * [6-8] Enables 16-bit push constants removing/changing asserts
> that don't
>      apply anymore to 16-bit case and a fix in the calculus os the
> size to be
>      read.
> 
> To catch this issues several new tests were developed and they will
> be included
> upstream to VK-GL-CTS.
> 
> This new version of this fixup series creates some conflicts in the
> re-submitted
> V5 series with the 16-bit Input/Output support that is still pending
> of review.
> An updated version including both series has been force-pushed at [2]
> 
> [1]
> https://lists.freedesktop.org/archives/mesa-dev/2018-February/186544.html
> 
> 
> [2] https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-rc5
> 
> 
> Cc: Jason Ekstrand  >
> 
> Jose Maria Casanova Crespo (8):
>   isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of
>     32-bit
>   i965/fs: shuffle_32bit_load_result_to_16bit_data now skips components
>   i965/fs: Support 16-bit do_read_vector with
>     VK_KHR_relaxed_block_layout
>   i965/fs: Support 16-bit store_ssbo with VK_KHR_relaxed_block_layout
>   anv: Enable VK_KHR_16bit_storage for SSBO and UBO
>   spirv: Calculate properly 16-bit vector sizes
>   spirv/i965/anv: Relax push constant offset assertions being 32-bit
>     aligned
>   anv: Enable VK_KHR_16bit_storage for PushConstant
> 
>  src/compiler/spirv/vtn_variables.c              |   9 +-
>  src/intel/compiler/brw_fs.cpp                   |   2 +-
>  src/intel/compiler/brw_fs.h                     |   3 +-
>  src/intel/compiler/brw_fs_nir.cpp               | 124
> ++--
>  src/intel/isl/isl_surface_state.c               |  22 -
>  src/intel/vulkan/anv_device.c                   |  18 +++-
>  src/intel/vulkan/anv_extensions.py              |   2 +-
>  src/intel/vulkan/anv_nir_lower_push_constants.c |   2 -
>  8 files changed, 137 insertions(+), 45 deletions(-)
> 
> --
> 2.14.3
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v2 2/8] i965/fs: shuffle_32bit_load_result_to_16bit_data now skips components

2018-02-28 Thread Chema Casanova
On 27/02/18 20:00, Jason Ekstrand wrote:
> On Tue, Feb 27, 2018 at 5:27 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> This helper used to load 16bit components from 32-bits read now allows
> skipping components with the new parameter first_component. The
> semantics
> now skip components until we reach the first_component, and then
> reads the
> number of components passed to the function.
> 
> All previous uses of the helper are updated to use 0 as first_component.
> This will allow read 16-bit components when the first one is not aligned
> 32-bit. Enabling more usages of untyped_reads with 16-bit types.
> ---
>  src/intel/compiler/brw_fs.cpp     | 2 +-
>  src/intel/compiler/brw_fs.h       | 3 ++-
>  src/intel/compiler/brw_fs_nir.cpp | 8 +---
>  3 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp
> b/src/intel/compiler/brw_fs.cpp
> index bed632d21b9..e961b76ab61 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -194,7 +194,7 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> fs_builder ,
>     fs_reg dw = offset(vec4_result, bld, (const_offset & 0xf) / 4);
>     switch (type_sz(dst.type)) {
>     case 2:
> -      shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);
> +      shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1, 0);
>        bld.MOV(dst, subscript(dw, dst.type, (const_offset / 2) & 1));
>        break;
>     case 4:
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 63373580ee4..52dd5e1d6bb 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -503,7 +503,8 @@ fs_reg shuffle_64bit_data_for_32bit_write(const
> brw::fs_builder ,
>  void shuffle_32bit_load_result_to_16bit_data(const brw::fs_builder
> ,
>                                               const fs_reg ,
>                                               const fs_reg ,
> -                                             uint32_t components);
> +                                             uint32_t components,
> +                                             uint32_t first_component);
> 
> 
> I hope this doesn't cause too much trouble, but I think it would be
> better to have first_component come before components in the argument
> list.  With that changed,

Not too much, it only affected 2 patches in the input/output series that
were already affected by this patch. So everything already changed locally.

> Reviewed-by: Jason Ekstrand  >

Thanks,

Chema

>  void shuffle_16bit_data_for_32bit_write(const brw::fs_builder ,
>                                          const fs_reg ,
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 4aa411d149f..5567433a19e 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -2316,7 +2316,7 @@ do_untyped_vector_read(const fs_builder ,
>           shuffle_32bit_load_result_to_16bit_data(bld,
>                 retype(dest, BRW_REGISTER_TYPE_W),
>                 retype(read_result, BRW_REGISTER_TYPE_D),
> -               num_components);
> +               num_components, 0);
>        } else {
>           assert(num_components == 1);
>           /* scalar 16-bit are read using one byte_scattered_read
> message */
> @@ -4908,7 +4908,8 @@ void
>  shuffle_32bit_load_result_to_16bit_data(const fs_builder ,
>                                          const fs_reg ,
>                                          const fs_reg ,
> -                                        uint32_t components)
> +                                        uint32_t components,
> +                                        uint32_t first_component)
>  {
>     assert(type_sz(src.type) == 4);
>     assert(type_sz(dst.type) == 2);
> @@ -4922,7 +4923,8 @@ shuffle_32bit_load_result_to_16bit_data(const
> fs_builder ,
> 
>     for (unsigned i = 0; i < components; i++) {
>        const fs_reg component_i =
> -         subscript(offset(src, bld, i / 2), dst.type, i % 2);
> +         subscript(offset(src, bld, (first_component + i) / 2),
> dst.type,
> +                   (first_component + i) % 2);
> 
>        bld.MOV(offset(tmp, bld, i % 2), component_i);
> 
> --
> 2.14.3
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
___

Re: [Mesa-dev] [PATCH v2 1/8] isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of 32-bit

2018-02-28 Thread Chema Casanova
On 27/02/18 19:53, Jason Ekstrand wrote:
> On Tue, Feb 27, 2018 at 5:27 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> The surfaces that backup the GPU buffers have a boundary check that
> considers that access to partial dwords are considered out-of-bounds.
> For example, buffers with 1,3 16-bit elements has size 2 or 6 and the
> last two bytes would always be read as 0 or its writting ignored.
> 
> The introduction of 16-bit types implies that we need to align the size
> to 4-bytew multiples so that partial dwords could be read/written.
> Adding an inconditional +2 size to buffers not being multiple of 2
> solves this issue for the general cases of UBO or SSBO.
> 
> But, when unsized arrays of 16-bit elements are used it is not possible
> to know if the size was padded or not. To solve this issue the
> implementation calculates the needed size of the buffer surfaces,
> as suggested by Jason:
> 
> surface_size = isl_align(buffer_size, 4) +
>                (isl_align(buffer_size, 4) - buffer_size)
> 
> So when we calculate backwards the buffer_size in the backend we
> update the resinfo return value with:
> 
> buffer_size = (surface_size & ~3) - (surface_size & 3)
> 
> It is also exposed this buffer requirements when robust buffer access
> is enabled so these buffer sizes recommend being multiple of 4.
> 
> v2: (Jason Ekstrand)
>     Move padding logic fron anv to isl_surface_state
>     Move calculus of original size from spirv to driver backend
> v3: (Jason Ekstrand)
>     Rename some variables and use a similar expresion when calculating
>     padding than when obtaining the original buffer size.
>     Avoid use of unnecesary component call at brw_fs_nir.
> 
> Reviewed-by: Jason Ekstrand  >
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 27 ++-
>  src/intel/isl/isl_surface_state.c | 22 +-
>  src/intel/vulkan/anv_device.c     | 11 +++
>  3 files changed, 58 insertions(+), 2 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 8efec34cc9d..4aa411d149f 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4290,7 +4290,32 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>        inst->mlen = 1;
>        inst->size_written = 4 * REG_SIZE;
> 
> -      bld.MOV(retype(dest, ret_payload.type),
> component(ret_payload, 0));
> +      /* SKL PRM, vol07, 3D Media GPGPU Engine, Bounds Checking and
> Faulting:
> +       *
> +       * "Out-of-bounds checking is always performed at a DWord
> granularity. If
> +       * any part of the DWord is out-of-bounds then the whole DWord is
> +       * considered out-of-bounds."
> +       *
> +       * This implies that types with size smaller than 4-bytes
> need to be
> +       * padded if they don't complete the last dword of the
> buffer. But as we
> +       * need to maintain the original size we need to reverse the
> padding
> +       * calculation to return the correct size to know the number
> of elements
> +       * of an unsized array. As we stored in the last two bits of
> the size of
> +       * the buffer the needed padding we calculate here:
> +       *
> +       * buffer_size = resinfo_size & ~3 - resinfo_size & 3
> 
> 
> Mind putting both calculations in this comment like you do in the one below?

Locally changed as:

 * As we stored in the last two bits of the surface
 * size the needed padding for the buffer, we calculate here the
 * original buffer_size reversing the surface_size calculation:
 *
 * surface_size = isl_align(buffer_size, 4) +
 *(isl_align(buffer_size) - buffer_size)
 *
 * buffer_size = surface_size & ~3 - surface_size & 3

I used the same names as in isl comment, so surface_size instead of
resinfo_size.

> rb still applies

Thanks.

> +       */
> +
> +      fs_reg size_aligned4 = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> +      fs_reg size_padding = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> +      fs_reg buffer_size = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> +
> +      ubld.AND(size_padding, ret_payload, brw_imm_ud(3));
> +      ubld.AND(size_aligned4, ret_payload, brw_imm_ud(~3));
> +      ubld.ADD(buffer_size, size_aligned4, negate(size_padding));
> +
> +      bld.MOV(retype(dest, ret_payload.type),
> component(buffer_size, 0));
> +
>        brw_mark_surface_used(prog_data, index);
>        break;
>     }
> diff --git a/src/intel/isl/isl_surface_state.c
> b/src/intel/isl/isl_surface_state.c
> index bfb27fa4a44..c205b3d2c0b 

Re: [Mesa-dev] [PATCH 1/7] isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of 32-bit (v2)

2018-02-26 Thread Chema Casanova
On 26/02/18 18:20, Jason Ekstrand wrote:
> I've lost track of what's reviewed and what's not.  Could you either
> just send a status list or do a resend once all the current comments are
> handled?

Reviewed-by and all feedback addressed
--
[1/7] anv/spirv: SSBO/UBO buffers needs padding size is not multiple of
32-bits.
[3/7] i965/fs: Support 16-bit store_ssbo with VK_KHR_relaxed_block_layout
[5/7] spirv: Calculate properly 16-bit vector sizes
[6/7] spirv/i965/anv: Relax push constant offset assertions being 32-bit
aligned

Feedback received but pending to finish v2
--
[2/7] i965/fs: Support 16-bit do_read_vector with
VK_KHR_relaxed_block_layout

To review when everything is ready
--
[4/7] anv: Enable VK_KHR_16bit_storage for SSBO and UBO
[7/7] anv: Enable VK_KHR_16bit_storage for PushConstant

I'll resend this series when [2/7] is ready.

Chema

> --Jason
> 
> 
> On February 26, 2018 09:08:01 Chema Casanova <jmcasan...@igalia.com> wrote:
> 
>> On 26/02/18 16:54, Jason Ekstrand wrote:
>>> On Mon, Feb 26, 2018 at 6:14 AM, Jose Maria Casanova Crespo
>>> <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>> wrote:
>>>
>>>     The surfaces that backup the GPU buffers have a boundary check that
>>>     considers that access to partial dwords are considered
>>> out-of-bounds.
>>>     For example, buffers with 1/3 16-bit elemnts has size 2 or 6 and the
>>>     last two bytes would always be read as 0 or its writting ignored.
>>>
>>>     The introduction of 16-bit types implies that we need to align
>>> the size
>>>     to 4-bytes multiples so that partial dwords could be read/written.
>>>     Adding an inconditional +2 size to buffers not being multiple of 2
>>>     solves this issue for the general cases of UBO or SSBO.
>>>
>>>     But, when unsized arrays of 16-bit elements are used it is not
>>> possible
>>>     to know if the size was padded or not. To solve this issue the
>>>     implementation calculates the needed size of the buffer surfaces,
>>>     as suggested by Jason:
>>>
>>>     surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size
>>
>>
>> Changed also the commit log according the the comments below.
>>
>>>
>>>     So when we calculate backwards the buffer_size in the backend we
>>>     update the resinfo return value with:
>>>
>>>     buffer_size = (surface_size & ~3) - (surface_size & 3)
>>>
>>>     It is also exposed this buffer requirements when robust buffer
>>> access
>>>     is enabled so these buffer sizes recommend being multiple of 4.
>>>
>>>     v2: (Jason Ekstrand)
>>>         Move padding logic fron anv to isl_surface_state
>>>         Move calculus of original size from spirv to driver backend
>>>     ---
>>>      src/intel/compiler/brw_fs_nir.cpp | 27 ++-
>>>      src/intel/isl/isl_surface_state.c | 21 -
>>>      src/intel/vulkan/anv_device.c     | 11 +++
>>>      3 files changed, 57 insertions(+), 2 deletions(-)
>>>
>>>     diff --git a/src/intel/compiler/brw_fs_nir.cpp
>>>     b/src/intel/compiler/brw_fs_nir.cpp
>>>     index 8efec34cc9d..d017af040b4 100644
>>>     --- a/src/intel/compiler/brw_fs_nir.cpp
>>>     +++ b/src/intel/compiler/brw_fs_nir.cpp
>>>     @@ -4290,7 +4290,32 @@ fs_visitor::nir_emit_intrinsic(const
>>>     fs_builder , nir_intrinsic_instr *instr
>>>            inst->mlen = 1;
>>>            inst->size_written = 4 * REG_SIZE;
>>>
>>>     -      bld.MOV(retype(dest, ret_payload.type),
>>>     component(ret_payload, 0));
>>>     +      /* SKL PRM, vol07, 3D Media GPGPU Engine, Bounds Checking and
>>>     Faulting:
>>>     +       *
>>>     +       * "Out-of-bounds checking is always performed at a DWord
>>>     granularity. If
>>>     +       * any part of the DWord is out-of-bounds then the whole
>>> DWord is
>>>     +       * considered out-of-bounds."
>>>     +       *
>>>     +       * This implies that types with size smaller than 4-bytes
>>>     (16-bits) need
>>>
>>>
>>> Yeah, 4-bytes (16-bits) is weird.  I'd just drop the "(16-bits)".
>>
>> Completely agree.
>>
>>>     +       * to be padded if they don't complete the last dword of the
>>>     buffer. B

Re: [Mesa-dev] [PATCH 3/7] i965/fs: Support 16-bit store_ssbo with VK_KHR_relaxed_block_layout

2018-02-26 Thread Chema Casanova
On 23/02/18 21:23, Jason Ekstrand wrote:
> On Fri, Feb 23, 2018 at 1:26 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> Restrict the use of untyped_surface_write with 16-bit pairs in
> ssbo to the cases where we can guarantee that offset is multiple
> of 4.
> 
> Taking into account that VK_KHR_relaxed_block_layout is available
> in ANV we can only guarantee that when we have a constant offset
> that is multiple of 4. For non constant offsets we will always use
> byte_scattered_write.
> 
> 
> I double-checked the rules and we can't even guarantee that a f16vec2 is
> dword-aligned.
>  
> 
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 22 +++---
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 45b8e8b637..abf9098252 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4135,6 +4135,8 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>           unsigned num_components = ffs(~(writemask >>
> first_component)) - 1;
>           fs_reg write_src = offset(val_reg, bld, first_component);
> 
> +         nir_const_value *const_offset =
> nir_src_as_const_value(instr->src[2]);
> +
>           if (type_size > 4) {
>              /* We can't write more than 2 64-bit components at
> once. Limit
>               * the num_components of the write to what we can do
> and let the next
> @@ -4150,14 +4152,19 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>               * 32-bit-aligned we need to use byte-scattered writes
> because
>               * untyped writes works with 32-bit components with 32-bit
>               * alignment. byte_scattered_write messages only
> support one
> -             * 16-bit component at a time.
> +             * 16-bit component at a time. As
> VK_KHR_relaxed_block_layout
> +             * could be enabled we can not guarantee that not
> constant offsets
> +             * to be 32-bit aligned for 16-bit types. For example
> an array, of
> +             * 16-bit vec3 with array element stride of 6.
>               *
> -             * For example, if there is a 3-components vector we
> submit one
> -             * untyped-write message of 32-bit (first two
> components), and one
> -             * byte-scattered write message (the last component).
> +             * In the case of 32-bit aligned constant offsets if
> there is
> +             * a 3-components vector we submit one untyped-write
> message
> +             * of 32-bit (first two components), and one byte-scattered
> +             * write message (the last component).
>               */
> 
> -            if (first_component % 2) {
> +            if ( !const_offset || ((const_offset->u32[0] +
> +                                   type_size * first_component) % 4)) {
>                 /* If we use a .yz writemask we also need to emit 2
>                  * byte-scattered write messages because of
> y-component not
>                  * being aligned to 32-bit.
> @@ -4183,7 +4190,7 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>           }
> 
>           fs_reg offset_reg;
> -         nir_const_value *const_offset =
> nir_src_as_const_value(instr->src[2]);
> +
>           if (const_offset) {
>              offset_reg = brw_imm_ud(const_offset->u32[0] +
>                                      type_size * first_component);
> @@ -4222,7 +4229,8 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>           } else {
>              assert(num_components * type_size <= 16);
>              assert((num_components * type_size) % 4 == 0);
> -            assert((first_component * type_size) % 4 == 0);
> +            assert(!const_offset ||
> +                   (const_offset->u32[0] + type_size *
> first_component) % 4 == 0);
> 
> 
> How about
> 
> assert(offset_reg.file != BRW_IMMEDIATE_VALUE || offset_reg.ud % 4 == 0);
> 
> We've already done the above calculation and stored it in offset_reg.

Makes sense.
> 
> Reviewed-by: Jason Ekstrand  >

Thanks for the review.

Chema

>              unsigned num_slots = (num_components * type_size) / 4;
> 
>              emit_untyped_write(bld, surf_index, offset_reg,
> --
> 2.14.3
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> 

Re: [Mesa-dev] [PATCH 1/7] isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of 32-bit (v2)

2018-02-26 Thread Chema Casanova
On 26/02/18 16:54, Jason Ekstrand wrote:
> On Mon, Feb 26, 2018 at 6:14 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> The surfaces that backup the GPU buffers have a boundary check that
> considers that access to partial dwords are considered out-of-bounds.
> For example, buffers with 1/3 16-bit elemnts has size 2 or 6 and the
> last two bytes would always be read as 0 or its writting ignored.
> 
> The introduction of 16-bit types implies that we need to align the size
> to 4-bytes multiples so that partial dwords could be read/written.
> Adding an inconditional +2 size to buffers not being multiple of 2
> solves this issue for the general cases of UBO or SSBO.
> 
> But, when unsized arrays of 16-bit elements are used it is not possible
> to know if the size was padded or not. To solve this issue the
> implementation calculates the needed size of the buffer surfaces,
> as suggested by Jason:
> 
> surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size


Changed also the commit log according the the comments below.

> 
> So when we calculate backwards the buffer_size in the backend we
> update the resinfo return value with:
> 
> buffer_size = (surface_size & ~3) - (surface_size & 3)
> 
> It is also exposed this buffer requirements when robust buffer access
> is enabled so these buffer sizes recommend being multiple of 4.
> 
> v2: (Jason Ekstrand)
>     Move padding logic fron anv to isl_surface_state
>     Move calculus of original size from spirv to driver backend
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 27 ++-
>  src/intel/isl/isl_surface_state.c | 21 -
>  src/intel/vulkan/anv_device.c     | 11 +++
>  3 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 8efec34cc9d..d017af040b4 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4290,7 +4290,32 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>        inst->mlen = 1;
>        inst->size_written = 4 * REG_SIZE;
> 
> -      bld.MOV(retype(dest, ret_payload.type),
> component(ret_payload, 0));
> +      /* SKL PRM, vol07, 3D Media GPGPU Engine, Bounds Checking and
> Faulting:
> +       *
> +       * "Out-of-bounds checking is always performed at a DWord
> granularity. If
> +       * any part of the DWord is out-of-bounds then the whole DWord is
> +       * considered out-of-bounds."
> +       *
> +       * This implies that types with size smaller than 4-bytes
> (16-bits) need
> 
> 
> Yeah, 4-bytes (16-bits) is weird.  I'd just drop the "(16-bits)".

Completely agree.

> +       * to be padded if they don't complete the last dword of the
> buffer. But
> +       * as we need to maintain the original size we need to
> reverse the padding
> +       * calculation to return the correct size to know the  number
> of elements
> +       * of an unsized array. As we stored in the last two bits of
> the size
> +       * of the buffer the needed padding we calculate here:
> +       *
> +       * buffer_size = resinfo_size & ~3 - resinfo_size & 3
> +       */
> +
> +      fs_reg size_aligned32 = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> 
> 
> I'd call this aligned4 because it's in units of bytes.

Locally changed.

> +      fs_reg size_padding = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> +      fs_reg buffer_size = ubld.vgrf(BRW_REGISTER_TYPE_UD);
> +
> +      ubld.AND(size_padding, component(ret_payload, 0), brw_imm_ud(3));
> +      ubld.AND(size_aligned32, component(ret_payload, 0),
> brw_imm_ud(~3));
> 
> 
> You don't really need the component() here.

Removed.

> +      ubld.ADD(buffer_size, size_aligned32, negate (size_padding));

Removed space after negate

> +
> +      bld.MOV(retype(dest, ret_payload.type),
> component(buffer_size, 0));
> +
>        brw_mark_surface_used(prog_data, index);
>        break;
>     }
> diff --git a/src/intel/isl/isl_surface_state.c
> b/src/intel/isl/isl_surface_state.c
> index bfb27fa4a44..ddc9eb53c96 100644
> --- a/src/intel/isl/isl_surface_state.c
> +++ b/src/intel/isl/isl_surface_state.c
> @@ -673,7 +673,26 @@ void
>  isl_genX(buffer_fill_state_s)(void *state,
>                                const struct
> isl_buffer_fill_state_info *restrict info)
>  {
> -   uint32_t num_elements = info->size / info->stride;
> +   uint64_t buffer_size = info->size;
> +
> +   /* Uniform and Storage buffers need to have surface size not
> less that the
> +    * aligned 32-bit size of the buffer. To calculate the array
>

Re: [Mesa-dev] [PATCH 1/7] isl/i965/fs: SSBO/UBO buffers need size padding if not multiple of 32-bit (v2)

2018-02-26 Thread Chema Casanova
On 26/02/18 15:40, Ilia Mirkin wrote:
> On Mon, Feb 26, 2018 at 9:14 AM, Jose Maria Casanova Crespo
>  wrote:
>> The surfaces that backup the GPU buffers have a boundary check that
>> considers that access to partial dwords are considered out-of-bounds.
>> For example, buffers with 1/3 16-bit elemnts has size 2 or 6 and the
>> last two bytes would always be read as 0 or its writting ignored.
>>
>> The introduction of 16-bit types implies that we need to align the size
>> to 4-bytes multiples so that partial dwords could be read/written.
>> Adding an inconditional +2 size to buffers not being multiple of 2
>> solves this issue for the general cases of UBO or SSBO.
>>
>> But, when unsized arrays of 16-bit elements are used it is not possible
>> to know if the size was padded or not. To solve this issue the
>> implementation calculates the needed size of the buffer surfaces,
>> as suggested by Jason:
>>
>> surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size
>>
>> So when we calculate backwards the buffer_size in the backend we
>> update the resinfo return value with:
>>
>> buffer_size = (surface_size & ~3) - (surface_size & 3)
>>
>> It is also exposed this buffer requirements when robust buffer access
>> is enabled so these buffer sizes recommend being multiple of 4.
>>
>> v2: (Jason Ekstrand)
>> Move padding logic fron anv to isl_surface_state
>> Move calculus of original size from spirv to driver backend
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 27 ++-
>>  src/intel/isl/isl_surface_state.c | 21 -
>>  src/intel/vulkan/anv_device.c | 11 +++
>>  3 files changed, 57 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index 8efec34cc9d..d017af040b4 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -4290,7 +4290,32 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>inst->mlen = 1;
>>inst->size_written = 4 * REG_SIZE;
>>
>> -  bld.MOV(retype(dest, ret_payload.type), component(ret_payload, 0));
>> +  /* SKL PRM, vol07, 3D Media GPGPU Engine, Bounds Checking and 
>> Faulting:
>> +   *
>> +   * "Out-of-bounds checking is always performed at a DWord 
>> granularity. If
>> +   * any part of the DWord is out-of-bounds then the whole DWord is
>> +   * considered out-of-bounds."
>> +   *
>> +   * This implies that types with size smaller than 4-bytes (16-bits) 
>> need
> 
> 32 bits?

The 16-bits was a kind of example, of type with size less than 4-bytes,
so better remove it.

Thanks

Chema

>> +   * to be padded if they don't complete the last dword of the buffer. 
>> But
>> +   * as we need to maintain the original size we need to reverse the 
>> padding
>> +   * calculation to return the correct size to know the  number of 
>> elements
>> +   * of an unsized array. As we stored in the last two bits of the size
>> +   * of the buffer the needed padding we calculate here:
>> +   *
>> +   * buffer_size = resinfo_size & ~3 - resinfo_size & 3
>> +   */
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/7] anv/spirv: SSBO/UBO buffers needs padding size is not multiple of 32-bits

2018-02-26 Thread Chema Casanova
El 23/02/18 a las 22:31, Jason Ekstrand escribió:
> On Fri, Feb 23, 2018 at 12:28 PM, Chema Casanova <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>> wrote:
> 
> 
> 
> El 23/02/18 a las 17:26, Jason Ekstrand escribió:
> > On Fri, Feb 23, 2018 at 1:26 AM, Jose Maria Casanova Crespo
> > <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>
> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
> >
> >     The surfaces that backup the GPU buffers have a boundary check
> that
> >     considers that access to partial dwords are considered
> out-of-bounds.
> >     For example is basic 16-bit cases of buffers with size 2 or 6
> where the
> >     last two bytes will always be read as 0 or its writting ignored.
> >
> >     The introduction of 16-bit types implies that we need to align
> the size
> >     to 4-bytes multiples so that partial dwords could be read/written.
> >     Adding an inconditional +2 size to buffers not being multiple of 2
> >     solves this issue for the general cases of UBO or SSBO.
> >
> >     But, when unsized_arrays of 16-bit elements are used it is not
> possible
> >     to know if the size was padded or not. To solve this issue the
> >     implementation of SSBO calculates the needed size of the surface,
> >     as suggested by Jason:
> >
> >     surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size
> >
> >     So when we calculate backwards the SpvOpArrayLenght with a nir
> expresion
> >     when the array stride is not multiple of 4.
> >
> >     array_size = (surface_size & ~3) - (surface_size & 3)
> >
> >     It is also exposed this buffer requirements when robust buffer
> access
> >     is enabled so these buffer sizes recommend being multiple of 4.
> >     ---
> >
> >     I have some doubts if vtn_variables.c is the best place to include
> >     this specific to calculate the real buffer size as this is new
> >     calculus seems to be quite HW dependent and maybe other drivers
> >     different
> >     to ANV don't need this kind of solution.
> >
> >      src/compiler/spirv/vtn_variables.c    | 14 ++
> >      src/intel/vulkan/anv_descriptor_set.c | 16 
> >      src/intel/vulkan/anv_device.c         | 11 +++
> >      3 files changed, 41 insertions(+)
> >
> >     diff --git a/src/compiler/spirv/vtn_variables.c
> >     b/src/compiler/spirv/vtn_variables.c
> >     index 9eb85c24e9..78adab3ed2 100644
> >     --- a/src/compiler/spirv/vtn_variables.c
> >     +++ b/src/compiler/spirv/vtn_variables.c
> >     @@ -2113,6 +2113,20 @@ vtn_handle_variables(struct vtn_builder *b,
> >     SpvOp opcode,
> >            nir_builder_instr_insert(>nb, >instr);
> >            nir_ssa_def *buf_size = >dest.ssa;
> >
> >     +      /* Calculate real length if padding was done to align
> the buffer
> >     +       * to 32-bits. This only could happen is stride is not
> multiple
> >     +       * of 4. Introduced to support 16-bit type unsized
> arrays in anv.
> >     +       */
> >     +      if (stride % 4) {
> >     +         buf_size = nir_isub(>nb,
> >     +                             nir_iand(>nb,
> >     +                                      buf_size,
> >     +                                      nir_imm_int(>nb, ~3)),
> >     +                             nir_iand (>nb,
> >     +                                       buf_size,
> >     +                                       nir_imm_int(>nb, 3)));
> >
> >
> > We can't do this in spirv_to_nir as it's also used by radv and
> they may
> > not have the same issue.  Instead, we need to handle it either in
> > anv_nir_apply_pipeline_layout or in the back-end compiler.  Doing it
> > here has the advantage that we can only do it in the "stride % 4 != 0"
> > case but I don't think the three instructions are all that big of
> a deal
> > given that we just did a send and are about to do an integer
> divide.  My
> > preference would be to put most of it in ISL and the back-end compiler
> > if we can.
> 
> I've already had my doubts in my commit comment. So I'll implement it
>

Re: [Mesa-dev] [PATCH 6/7] spirv/i965/anv: Relax push constant offset assertions being 32-bit aligned

2018-02-26 Thread Chema Casanova
El 23/02/18 a las 22:36, Jason Ekstrand escribió:
> Assuming the CTS is still happy with it after those changes,

CTS was happy, but piglit has complained a lot.

> 
> Reviewed-by: Jason Ekstrand <ja...@jlekstrand.net
> <mailto:ja...@jlekstrand.net>>
> 
> On Fri, Feb 23, 2018 at 1:16 PM, Chema Casanova <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>> wrote:
> 
> On 23/02/18 20:09, Jason Ekstrand wrote:
> > On Fri, Feb 23, 2018 at 1:26 AM, Jose Maria Casanova Crespo
> > <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>
> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
> >
> >     The introduction of 16-bit types with VK_KHR_16bit_storages
> implies that
> >     push constant offsets could be multiple of 2-bytes. Some
> assertions are
> >     relaxed so offsets can be multiple of 4-bytes or multiple of
> size of the
> >     base type.
> >
> >     For 16-bit types, the push constant offset takes into account the
> >     internal offset in the 32-bit uniform bucket adding 2-bytes
> when we
> >     access
> >     not 32-bit aligned elements. In all 32-bit aligned cases it just
> >     becomes 0.
> >     ---
> >      src/compiler/spirv/vtn_variables.c              |  1 -
> >      src/intel/compiler/brw_fs_nir.cpp               | 16
> +++-
> >      src/intel/vulkan/anv_nir_lower_push_constants.c |  2 --
> >      3 files changed, 11 insertions(+), 8 deletions(-)
> >
> >     diff --git a/src/compiler/spirv/vtn_variables.c
> >     b/src/compiler/spirv/vtn_variables.c
> >     index 81658afbd9..87236d0abd 100644
> >     --- a/src/compiler/spirv/vtn_variables.c
> >     +++ b/src/compiler/spirv/vtn_variables.c
> >     @@ -760,7 +760,6 @@ _vtn_load_store_tail(struct vtn_builder *b,
> >     nir_intrinsic_op op, bool load,
> >         }
> >
> >         if (op == nir_intrinsic_load_push_constant) {
> >     -      vtn_assert(access_offset % 4 == 0);
> >
> >            nir_intrinsic_set_base(instr, access_offset);
> >            nir_intrinsic_set_range(instr, access_size);
> >     diff --git a/src/intel/compiler/brw_fs_nir.cpp
> >     b/src/intel/compiler/brw_fs_nir.cpp
> >     index abf9098252..27611a21d0 100644
> >     --- a/src/intel/compiler/brw_fs_nir.cpp
> >     +++ b/src/intel/compiler/brw_fs_nir.cpp
> >     @@ -3887,16 +3887,22 @@ fs_visitor::nir_emit_intrinsic(const
> >     fs_builder , nir_intrinsic_instr *instr
> >            break;
> >
> >         case nir_intrinsic_load_uniform: {
> >     -      /* Offsets are in bytes but they should always be multiples
> >     of 4 */
> >     -      assert(instr->const_index[0] % 4 == 0);
> >     +      /* Offsets are in bytes but they should always be
> multiple of 4
> >     +       * or multiple of the size of the destination type. 2
> for 16-bits
> >     +       * types.
> >
> >     +       */
> >     +      assert(instr->const_index[0] % 4 == 0 ||
> >     +             instr->const_index[0] % type_sz(dest.type) == 0);
> >
> >
> > Doubles are required to be 8-byte aligned so we can just have the dest
> > type size check.
> 
> Changed locally.


It seems that we can not guarantee that doubles are aligned
with offset of 8-bytes,

Without the % 4 == 0  several tests crash like:

Test:
piglit.spec.arb_gpu_shader_int64.execution.conversion.frag-conversion-explicit-uvec3-i64vec3

In this case we have a ivec3 uniform and then a i64vec3 whose offset
becomes 12 at this assertion.


> >  
> >
> >            fs_reg src(UNIFORM, instr->const_index[0] / 4, dest.type);
> >
> >            nir_const_value *const_offset =
> >     nir_src_as_const_value(instr->src[0]);
> >            if (const_offset) {
> >     -         /* Offsets are in bytes but they should always be
> >     multiples of 4 */
> >     -         assert(const_offset->u32[0] % 4 == 0);
> >     -         src.offset = const_offset->u32[0];
> >     +         assert(const_offset->u32[0] % 4 == 0 ||
> >     +                const_offset->u32[0] % type_sz(dest.type) == 0);
> >
> >
> > Same here.
> 
> Changed locally.

This assertion change didn't raise any regression. I'm sending v

Re: [Mesa-dev] [PATCH 6/7] spirv/i965/anv: Relax push constant offset assertions being 32-bit aligned

2018-02-23 Thread Chema Casanova
On 23/02/18 20:09, Jason Ekstrand wrote:
> On Fri, Feb 23, 2018 at 1:26 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> The introduction of 16-bit types with VK_KHR_16bit_storages implies that
> push constant offsets could be multiple of 2-bytes. Some assertions are
> relaxed so offsets can be multiple of 4-bytes or multiple of size of the
> base type.
> 
> For 16-bit types, the push constant offset takes into account the
> internal offset in the 32-bit uniform bucket adding 2-bytes when we
> access
> not 32-bit aligned elements. In all 32-bit aligned cases it just
> becomes 0.
> ---
>  src/compiler/spirv/vtn_variables.c              |  1 -
>  src/intel/compiler/brw_fs_nir.cpp               | 16 +++-
>  src/intel/vulkan/anv_nir_lower_push_constants.c |  2 --
>  3 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/src/compiler/spirv/vtn_variables.c
> b/src/compiler/spirv/vtn_variables.c
> index 81658afbd9..87236d0abd 100644
> --- a/src/compiler/spirv/vtn_variables.c
> +++ b/src/compiler/spirv/vtn_variables.c
> @@ -760,7 +760,6 @@ _vtn_load_store_tail(struct vtn_builder *b,
> nir_intrinsic_op op, bool load,
>     }
> 
>     if (op == nir_intrinsic_load_push_constant) {
> -      vtn_assert(access_offset % 4 == 0);
> 
>        nir_intrinsic_set_base(instr, access_offset);
>        nir_intrinsic_set_range(instr, access_size);
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index abf9098252..27611a21d0 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -3887,16 +3887,22 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>        break;
> 
>     case nir_intrinsic_load_uniform: {
> -      /* Offsets are in bytes but they should always be multiples
> of 4 */
> -      assert(instr->const_index[0] % 4 == 0);
> +      /* Offsets are in bytes but they should always be multiple of 4
> +       * or multiple of the size of the destination type. 2 for 16-bits
> +       * types.
> 
> +       */
> +      assert(instr->const_index[0] % 4 == 0 ||
> +             instr->const_index[0] % type_sz(dest.type) == 0);
> 
> 
> Doubles are required to be 8-byte aligned so we can just have the dest
> type size check.

Changed locally.

>  
> 
>        fs_reg src(UNIFORM, instr->const_index[0] / 4, dest.type);
> 
>        nir_const_value *const_offset =
> nir_src_as_const_value(instr->src[0]);
>        if (const_offset) {
> -         /* Offsets are in bytes but they should always be
> multiples of 4 */
> -         assert(const_offset->u32[0] % 4 == 0);
> -         src.offset = const_offset->u32[0];
> +         assert(const_offset->u32[0] % 4 == 0 ||
> +                const_offset->u32[0] % type_sz(dest.type) == 0);
> 
> 
> Same here.

Changed locally.

> +         /* For 16-bit types we add the module of the const_index[0]
> +          * offset to access to not 32-bit aligned element */
> +         src.offset = const_offset->u32[0] + instr->const_index[0] % 4;
> 
>           for (unsigned j = 0; j < instr->num_components; j++) {
>              bld.MOV(offset(dest, bld, j), offset(src, bld, j));
> diff --git a/src/intel/vulkan/anv_nir_lower_push_constants.c
> b/src/intel/vulkan/anv_nir_lower_push_constants.c
> index b66552825b..ad60d0c824 100644
> --- a/src/intel/vulkan/anv_nir_lower_push_constants.c
> +++ b/src/intel/vulkan/anv_nir_lower_push_constants.c
> @@ -41,8 +41,6 @@ anv_nir_lower_push_constants(nir_shader *shader)
>              if (intrin->intrinsic != nir_intrinsic_load_push_constant)
>                 continue;
> 
> -            assert(intrin->const_index[0] % 4 == 0);
> -
>              /* We just turn them into uniform loads */
>              intrin->intrinsic = nir_intrinsic_load_uniform;
>           }
> --
> 2.14.3
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/7] anv/spirv: SSBO/UBO buffers needs padding size is not multiple of 32-bits

2018-02-23 Thread Chema Casanova


El 23/02/18 a las 17:26, Jason Ekstrand escribió:
> On Fri, Feb 23, 2018 at 1:26 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> The surfaces that backup the GPU buffers have a boundary check that
> considers that access to partial dwords are considered out-of-bounds.
> For example is basic 16-bit cases of buffers with size 2 or 6 where the
> last two bytes will always be read as 0 or its writting ignored.
> 
> The introduction of 16-bit types implies that we need to align the size
> to 4-bytes multiples so that partial dwords could be read/written.
> Adding an inconditional +2 size to buffers not being multiple of 2
> solves this issue for the general cases of UBO or SSBO.
> 
> But, when unsized_arrays of 16-bit elements are used it is not possible
> to know if the size was padded or not. To solve this issue the
> implementation of SSBO calculates the needed size of the surface,
> as suggested by Jason:
> 
> surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size
> 
> So when we calculate backwards the SpvOpArrayLenght with a nir expresion
> when the array stride is not multiple of 4.
> 
> array_size = (surface_size & ~3) - (surface_size & 3)
> 
> It is also exposed this buffer requirements when robust buffer access
> is enabled so these buffer sizes recommend being multiple of 4.
> ---
> 
> I have some doubts if vtn_variables.c is the best place to include
> this specific to calculate the real buffer size as this is new
> calculus seems to be quite HW dependent and maybe other drivers
> different
> to ANV don't need this kind of solution.
> 
>  src/compiler/spirv/vtn_variables.c    | 14 ++
>  src/intel/vulkan/anv_descriptor_set.c | 16 
>  src/intel/vulkan/anv_device.c         | 11 +++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/src/compiler/spirv/vtn_variables.c
> b/src/compiler/spirv/vtn_variables.c
> index 9eb85c24e9..78adab3ed2 100644
> --- a/src/compiler/spirv/vtn_variables.c
> +++ b/src/compiler/spirv/vtn_variables.c
> @@ -2113,6 +2113,20 @@ vtn_handle_variables(struct vtn_builder *b,
> SpvOp opcode,
>        nir_builder_instr_insert(>nb, >instr);
>        nir_ssa_def *buf_size = >dest.ssa;
> 
> +      /* Calculate real length if padding was done to align the buffer
> +       * to 32-bits. This only could happen is stride is not multiple
> +       * of 4. Introduced to support 16-bit type unsized arrays in anv.
> +       */
> +      if (stride % 4) {
> +         buf_size = nir_isub(>nb,
> +                             nir_iand(>nb,
> +                                      buf_size,
> +                                      nir_imm_int(>nb, ~3)),
> +                             nir_iand (>nb,
> +                                       buf_size,
> +                                       nir_imm_int(>nb, 3)));
> 
> 
> We can't do this in spirv_to_nir as it's also used by radv and they may
> not have the same issue.  Instead, we need to handle it either in
> anv_nir_apply_pipeline_layout or in the back-end compiler.  Doing it
> here has the advantage that we can only do it in the "stride % 4 != 0"
> case but I don't think the three instructions are all that big of a deal
> given that we just did a send and are about to do an integer divide.  My
> preference would be to put most of it in ISL and the back-end compiler
> if we can.

I've already had my doubts in my commit comment. So I'll implement it
properly in the backend implementation of nir_intrinsic_get_buffer_size.
I should have a look to that code before.

> 
> +      }
> +
>        /* array_length = max(buffer_size - offset, 0) / stride */
>        nir_ssa_def *array_length =
>           nir_idiv(>nb,
> diff --git a/src/intel/vulkan/anv_descriptor_set.c
> b/src/intel/vulkan/anv_descriptor_set.c
> index edb829601e..a97f2f37dc 100644
> --- a/src/intel/vulkan/anv_descriptor_set.c
> +++ b/src/intel/vulkan/anv_descriptor_set.c
> @@ -704,6 +704,22 @@ anv_descriptor_set_write_buffer(struct
> anv_descriptor_set *set,
>        bview->offset = buffer->offset + offset;
>        bview->range = anv_buffer_get_range(buffer, offset, range);
> 
> +      /* Uniform and Storage buffers need to have surface size
> +       * not less that the aligned 32-bit size of the buffer.
> +       * To calculate the array lenght on unsized arrays
> +       * in StorageBuffer the last 2 bits store the padding size
> +       * added to the surface, so we can calculate latter the original
> +       * buffer size to know the number of elements.
> +       *
> +       *  surface_size = 2 * aling_u64(buffer_size, 4)  - buffer_size
> +       *
> +       *  array_size = (surface_size & ~3) - 

Re: [Mesa-dev] [PATCH v4 00/44] anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+

2017-12-11 Thread Chema Casanova
El 09/12/17 a las 00:52, Jason Ekstrand escribió:
> While reviewing some of the UBO pushing comments from Topi, I
> discovered a fairly disturbing assert in brw_fs_nir.cpp in our
> implementation of nir_intrinsic_load_uniform:
>
>  /* Offsets are in bytes but they should always be multiples
> of 4 */
>  assert(const_offset->u32[0] % 4 == 0);
>
> This assertion isn't triggering with 16bit storage enabled for push
> constants.  Looking at the CTS tests in a bit more detail, they're
> very poor.  They only test basic types (scalars, vectors, and
> matrices) and only in arrays with a dynamic index.  This means that
> the constant optimization paths for UBO pulls aren't getting triggered
> at all.  Also, we're not using push constants with any offsets not
> aligned to 4 (as per the above assert) so there's no real assurance
> that that works.  Given that constant offsets are a very common case
> for apps, this is very disappointing.  For the moment, I'm going to
> push a patch to master to disable 16bit storage.  I'm really sorry
> about that.  I think your code is great and, based on my review, I'm
> pretty sure it should work but I don't think we can really ship this
> extension in good faith when we know that there is a massive gaping
> hole in test coverage like this.  (The coverage hole is not your
> fault!)  I've also filed a bug (893) against the CTS.

I agree that is better to increase testing coverage before enabling the
feature after your findings. At the same time for having the complete
support for VK_KHR_16bit_storage it is still pending the review of part
of input/output support of the feature, so we are not in a hurry to have
it enabled.

Chema

>
> On Wed, Dec 6, 2017 at 12:09 AM, Alejandro Piñeiro
> <apinhe...@igalia.com <mailto:apinhe...@igalia.com>> wrote:
>
> On 06/12/17 01:19, Chema Casanova wrote:
> > On 05/12/17 18:31, Chema Casanova wrote:
> >> El 05/12/17 a las 06:16, Jason Ekstrand escribió:
> >>> A couple of notes:
> >>>
> >>>  1) I *think* I gave you enough reviews to land the UBO/SSBO
> part and
> >>> the optimizations in 26-28.  If reviews are still missing
> anywhere,
> >>> please let me know.  If not, let's try and get that part landed.
> >> The series is almost ready to land, I have only pending to
> address your
> >> feedback about use untyped_read for reading vec3 ssbos.
> >>
> >> The only missing explicit R-b is that " [PATCH v4 28/44]
> i965/fs: Use
> >> untyped_surface_read for 16-bit load_ssbo" and "[PATCH v4 23/44]
> >> i965/fs: Enables 16-bit load_ubo with sampler" i've just
> answered your
> >> review to confirm the R-b.
> >>
> >> I expect to finish today vec3 ssbo and send the series to
> Jenkins before
> >> landing, confirm your "pending" R-b, do a last rebase to master
> and ask
> >> for a push.
> > I've just prepared a rebased branch with the reviewed commits
> ready to
> > land to enable VK_KHR_16bit_storage support for SSBO/UBO.
> >
> >
> 
> https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-v4-ubo-ssbo-to-land
> 
> <https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-v4-ubo-ssbo-to-land>
> >
> > As I don't have still commit access to mesa, maybe Eduardo or
> Alejandro
> > can land it for me tomorrow. But, Jason feel free to push it if
> you want.
>
> I have just pushed it to master.
>
> >
> > Chema
> >
> >>>  2) I send out a patch to rewrite assign_constant_locations
> which I
> >>> think should make it automatically handle 8 and 16-bit values as
> >>> well.  I'd rather do that than more special casing if
> everything works
> >>> out ok.
> >> I'm testing this patch with 16-bits and make sure whatever is
> needed to
> >> have 16-bit working.
> >>
> >>>  3) I sent out a series of patches to enable pushing of UBOs in
> >>> Vulkan.  If we're not careful, these will clash with 16bit
> storage as
> >>> UBO support suddenly has to imply push constant support.  That
> said,
> >>> I"m willing to wait at least a little while before landing
> them to let
> >>> us get 16bit push constant support sorted out.  The UBO pushing
> >>> patches give us a nice little performance boost but we're
> nowhere nea

Re: [Mesa-dev] [PATCH 1/2] i965/fs: Rewrite assign_constant_locations

2017-12-06 Thread Chema Casanova
I've tested this patch against the VK-CTS push constant 16-bit tests,
and enabled storagePushConstant16 at VK_KHR_16bit_storage. All test pass
without any extra modification.

dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant.*
dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.push_constant.*

All test pass. I have pending in my TODO to build a test mixing push
constant values with different bitsizes.

Tested-by: Jose Maria Casanova Crespo 


On 04/12/17 02:50, Jason Ekstrand wrote:
> This rewires the logic for assigning uniform locations to work in terms
> of "complex alignments".  The basic idea is that, as we walk the list of
> instructions, we keep track of the alignment and continuity requirements
> of each slot and assert that the alignments all match up.  We then use
> those alignments in the compaction stage to ensure that everything gets
> placed at a properly aligned register.  The old mechanism handled
> alignments by special-casing each of the bit sizes and placing 64-bit
> values first followed by 32-bit values.
> 
> The old scheme had the advantage of never leaving a hole since all the
> 64-bit values could be tightly packed and so could the 32-bit values.
> However, the new scheme has no type size special cases so it handles not
> only 32 and 64-bit types but should gracefully extend to 16 and 8-bit
> types as the need arises.
> 
> Cc: Kenneth Graunke 
> Cc: Jose Maria Casanova Crespo 
> ---
>  src/intel/compiler/brw_fs.cpp | 307 
> --
>  1 file changed, 174 insertions(+), 133 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 6772c0d..ffd8e12 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -1874,62 +1874,6 @@ fs_visitor::compact_virtual_grfs()
> return progress;
>  }
>  
> -static void
> -set_push_pull_constant_loc(unsigned uniform, int *chunk_start,
> -   unsigned *max_chunk_bitsize,
> -   bool contiguous, unsigned bitsize,
> -   const unsigned target_bitsize,
> -   int *push_constant_loc, int *pull_constant_loc,
> -   unsigned *num_push_constants,
> -   unsigned *num_pull_constants,
> -   const unsigned max_push_components,
> -   const unsigned max_chunk_size,
> -   bool allow_pull_constants,
> -   struct brw_stage_prog_data *stage_prog_data)
> -{
> -   /* This is the first live uniform in the chunk */
> -   if (*chunk_start < 0)
> -  *chunk_start = uniform;
> -
> -   /* Keep track of the maximum bit size access in contiguous uniforms */
> -   *max_chunk_bitsize = MAX2(*max_chunk_bitsize, bitsize);
> -
> -   /* If this element does not need to be contiguous with the next, we
> -* split at this point and everything between chunk_start and u forms a
> -* single chunk.
> -*/
> -   if (!contiguous) {
> -  /* If bitsize doesn't match the target one, skip it */
> -  if (*max_chunk_bitsize != target_bitsize) {
> - /* FIXME: right now we only support 32 and 64-bit accesses */
> - assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
> - *max_chunk_bitsize = 0;
> - *chunk_start = -1;
> - return;
> -  }
> -
> -  unsigned chunk_size = uniform - *chunk_start + 1;
> -
> -  /* Decide whether we should push or pull this parameter.  In the
> -   * Vulkan driver, push constants are explicitly exposed via the API
> -   * so we push everything.  In GL, we only push small arrays.
> -   */
> -  if (!allow_pull_constants ||
> -  (*num_push_constants + chunk_size <= max_push_components &&
> -   chunk_size <= max_chunk_size)) {
> - assert(*num_push_constants + chunk_size <= max_push_components);
> - for (unsigned j = *chunk_start; j <= uniform; j++)
> -push_constant_loc[j] = (*num_push_constants)++;
> -  } else {
> - for (unsigned j = *chunk_start; j <= uniform; j++)
> -pull_constant_loc[j] = (*num_pull_constants)++;
> -  }
> -
> -  *max_chunk_bitsize = 0;
> -  *chunk_start = -1;
> -   }
> -}
> -
>  static int
>  get_subgroup_id_param_index(const brw_stage_prog_data *prog_data)
>  {
> @@ -1945,6 +1889,98 @@ get_subgroup_id_param_index(const brw_stage_prog_data 
> *prog_data)
>  }
>  
>  /**
> + * Struct for handling complex alignments.
> + *
> + * A complex alignment is stored as multiplier and an offset.  A value is
> + * considered to be aligned if it is congruent to the offset modulo the
> + * multiplier.
> + */
> +struct cplx_align {
> +   unsigned mul:4;
> +   unsigned offset:4;
> +};
> +
> +#define CPLX_ALIGN_MAX_MUL 8
> +
> +static void
> 

Re: [Mesa-dev] [PATCH v4 00/44] anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+

2017-12-05 Thread Chema Casanova
On 05/12/17 18:31, Chema Casanova wrote:
> El 05/12/17 a las 06:16, Jason Ekstrand escribió:
>> A couple of notes:
>>
>>  1) I *think* I gave you enough reviews to land the UBO/SSBO part and
>> the optimizations in 26-28.  If reviews are still missing anywhere,
>> please let me know.  If not, let's try and get that part landed.
> 
> The series is almost ready to land, I have only pending to address your
> feedback about use untyped_read for reading vec3 ssbos.
> 
> The only missing explicit R-b is that " [PATCH v4 28/44] i965/fs: Use
> untyped_surface_read for 16-bit load_ssbo" and "[PATCH v4 23/44]
> i965/fs: Enables 16-bit load_ubo with sampler" i've just answered your
> review to confirm the R-b.
> 
> I expect to finish today vec3 ssbo and send the series to Jenkins before
> landing, confirm your "pending" R-b, do a last rebase to master and ask
> for a push.

I've just prepared a rebased branch with the reviewed commits ready to
land to enable VK_KHR_16bit_storage support for SSBO/UBO.

https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-v4-ubo-ssbo-to-land

As I don't have still commit access to mesa, maybe Eduardo or Alejandro
can land it for me tomorrow. But, Jason feel free to push it if you want.

Chema

>>  2) I send out a patch to rewrite assign_constant_locations which I
>> think should make it automatically handle 8 and 16-bit values as
>> well.  I'd rather do that than more special casing if everything works
>> out ok.
> 
> I'm testing this patch with 16-bits and make sure whatever is needed to
> have 16-bit working.
> 
>>
>>  3) I sent out a series of patches to enable pushing of UBOs in
>> Vulkan.  If we're not careful, these will clash with 16bit storage as
>> UBO support suddenly has to imply push constant support.  That said,
>> I"m willing to wait at least a little while before landing them to let
>> us get 16bit push constant support sorted out.  The UBO pushing
>> patches give us a nice little performance boost but we're nowhere near
>> a release and I don't want it blocking you.
> 
> That would be my next priority, so we would only have pending to land
> the 16-bit input/output support to finish this extension.
> 
> Chema
> 
>> On Wed, Nov 29, 2017 at 6:07 PM, Jose Maria Casanova Crespo
>> <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>> wrote:
>>
>> Hello,
>>
>> this is the V4 series for the implementation of the
>> SPV_KHR_16bit_storage
>> and VK_KHR_16bit_storage extensions on the anv vulkan driver, in
>> addition
>> to the GLSL and NIR support needed.
>>
>> The original series can be found here [1], the following v2 [2]
>> and v3 [3].
>>
>> In short V4 includes the following:
>>
>>  * Reorder the series to enable features as they are implemented,
>> the series
>>    now enables first UBO and SSBO support, and then inputs/outputs and
>>    finally push constants.
>>  * Support the byte scattered read/write messages with different
>> bit sizes
>>    byte/word/dword.
>>  * Refactor of the store_ssbo code and also fix stores when
>> writemask was .yz
>>  * Uses the sampler for load_ubo avoiding the initial
>> implementation of
>>    the series using byte_scattered_read.
>>  * Addressed all the feedback provided by Jason and Topi on v3 review.
>>
>> This series is also available at:
>>
>> https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-rc4
>> <https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-rc4>
>>
>> The objective is to start landing part of this series, all
>> feedback has been
>> addressed for SSBO and UBO. But for input/outputs features it will
>> probably
>> need another iteration as was not completely reviewed. It is also
>> needed
>> to define the approach for push constants issues before of after
>> landing
>> the support with this implementation.
>>
>> Patches 1-5 and 8-17 have already been reviewed. Patch 7 was already
>> reviewed but as it has changed too much i would appreciate another
>> review. When patches until 25 or 28 are reviewed we could land
>> UBOs and
>> SSBOs support.
>>
>> Finally an updated overview of the patches:
>>
>> Patches 1-2 add 16-bit float, int and uint types to GLSL. This is
>> needed because NIR uses GLSL types internally. We use the enums
>> already defined at AMD_gpu_shader_half_float a

Re: [Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler

2017-12-05 Thread Chema Casanova


On 05/12/17 22:25, Chema Casanova wrote:
> On 05/12/17 19:53, Jason Ekstrand wrote:
>> On Tue, Dec 5, 2017 at 9:08 AM, Chema Casanova <jmcasan...@igalia.com
>> <mailto:jmcasan...@igalia.com>> wrote:
>>
>> El 30/11/17 a las 23:58, Jason Ekstrand escribió:
>> > On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
>> > <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>
>> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
>> >
>> >     load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
>> >     surface format defined. So when reading 16-bit components with the
>> >     sampler we need to unshuffle two 16-bit components from each
>> 32-bit
>> >     component.
>> >
>> >     Using the sampler avoids the use of the byte_scattered_read
>> message
>> >     that needs one message for each component and is supposed to be
>> >     slower.
>> >
>> >     In the case of SKL+ we take advantage of a hardware feature that
>> >     automatically defines a channel mask based on the rlen value,
>> so on
>> >     SKL+ we only use half of the registers without using a header
>> in the
>> >     payload.
>> >     ---
>> >      src/intel/compiler/brw_fs.cpp           | 31
>> >     +++
>> >      src/intel/compiler/brw_fs_generator.cpp | 10 --
>> >      2 files changed, 35 insertions(+), 6 deletions(-)
>> >
>> >     diff --git a/src/intel/compiler/brw_fs.cpp
>> >     b/src/intel/compiler/brw_fs.cpp
>> >     index 1ca4d416b2..9c543496ba 100644
>> >     --- a/src/intel/compiler/brw_fs.cpp
>> >     +++ b/src/intel/compiler/brw_fs.cpp
>> >     @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
>> >     fs_builder ,
>> >          * a double this means we are only loading 2 elements worth of
>> >     data.
>> >          * We also want to use a 32-bit data type for the dst of the
>> >     load operation
>> >          * so other parts of the driver don't get confused about the
>> >     size of the
>> >     -    * result.
>> >     +    * result. On the case of 16-bit data we only need half of the
>> >     32-bit
>> >     +    * components on SKL+ as we take advance of using message
>> >     return size to
>> >     +    * define an xy channel mask.
>> >          */
>> >     -   fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
>> >     +   fs_reg vec4_result;
>> >     +   if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {
>> >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);
>> >     +      vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);
>> >     +   } else {
>> >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
>> >     +   }
>> >
>> >         fs_inst *inst =
>> >     bld.emit(FS_OPCODE_VARYING_PULL_CONSTANT_LOAD_LOGICAL,
>> >                                  vec4_result, surf_index,
>> vec4_offset);
>> >         inst->size_written = 4 *
>> >     vec4_result.component_size(inst->exec_size);
>> >     @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
>> >     fs_builder ,
>> >         }
>> >
>> >         vec4_result.type = dst.type;
>> >     -   bld.MOV(dst, offset(vec4_result, bld,
>> >     -                       (const_offset & 0xf) /
>> >     type_sz(vec4_result.type)));
>> >     +
>> >     +   if (type_sz(dst.type) == 2) {
>> >     +      /* 16-bit types need to be unshuffled as each pair of
>> >     16-bit components
>> >     +       * is packed on a 32-bit compoment because we are using a
>> >     32-bit format
>> >     +       * in the surface of uniform that is read by the sampler.
>> >     +       * TODO: On BDW+ mark when an uniform has 16-bit type so we
>> >     could setup a
>> >     +       * surface format of 16-bit and use the 16-bit return
>> >     format at the
>> >     +       * sampler.
>> >     +       */
>> >     +      

Re: [Mesa-dev] [PATCH v4 28/44] i965/fs: Use untyped_surface_read for 16-bit load_ssbo (v2)

2017-12-05 Thread Chema Casanova
On 05/12/17 23:47, Jason Ekstrand wrote:
> On Tue, Dec 5, 2017 at 1:36 PM, Jose Maria Casanova Crespo
> <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>> wrote:
> 
> SSBO loads were using byte_scattered read messages as they allow
> reading 16-bit size components. byte_scattered messages can only
> operate one component at a time so we needed to emit as many messages
> as components.
> 
> But for vec2 and vec4 of 16-bit, being multiple of 32-bit we can use the
> untyped_surface_read message to read pairs of 16-bit components using
> only one message. Once each pair is read it is unshuffled to return the
> proper 16-bit components. vec3 case is assimilated to vec4 but the 4th
> component is ignored.
> 
> 16-bit scalars are read using one byte_scattered_read message.
> 
> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
>         Rework optimization using unshuffle 16 reads (Chema Casanova)
> v3: Use W and D types insead of HF and F in shuffle to avoid rounding
>     erros (Jason Ekstrand)
>     Use untyped_surface_read for 16-bit vec3. (Jason Ekstrand)
> 
> CC: Jason Ekstrand <ja...@jlekstrand.net <mailto:ja...@jlekstrand.net>>
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 29 ++---
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index e11e75e6332..8deec082d59 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -2303,16 +2303,31 @@ do_untyped_vector_read(const fs_builder ,
>                         unsigned num_components)
>  {
>     if (type_sz(dest.type) <= 2) {
> -      fs_reg read_offset = bld.vgrf(BRW_REGISTER_TYPE_UD);
> -      bld.MOV(read_offset, offset_reg);
> -      for (unsigned i = 0; i < num_components; i++) {
> -         fs_reg read_reg =
> -            emit_byte_scattered_read(bld, surf_index, read_offset,
> +      assert(dest.stride == 1);
> +
> +      if (num_components > 1) {
> +         /* Pairs of 16-bit components can be read with untyped
> read, for 16-bit
> +          * vec3 4th component is ignored.
> +          */
> +         fs_reg read_result =
> +            emit_untyped_read(bld, surf_index, offset_reg,
> +                              1 /* dims */,
> DIV_ROUND_UP(num_components, 2),
> +                              BRW_PREDICATE_NONE);
> +         shuffle_32bit_load_result_to_16bit_data(bld,
> +               retype(dest, BRW_REGISTER_TYPE_W),
> +               retype(read_result, BRW_REGISTER_TYPE_D),
> +               num_components);
> +      } else {
> +         assert(num_components == 1);
> +         /* scalar 16-bit are read using one byte_scattered_read
> message */
> +         fs_reg read_result =
> +            emit_byte_scattered_read(bld, surf_index, offset_reg,
>                                       1 /* dims */, 1,
>                                       type_sz(dest.type) * 8 /*
> bit_size */,
>                                       BRW_PREDICATE_NONE);
> -         bld.MOV(offset(dest, bld, i), subscript(read_reg,
> dest.type, 0));
> -         bld.ADD(read_offset, read_offset,
> brw_imm_ud(type_sz(dest.type)));
> +         read_result.type = dest.type;
> +         read_result.stride = 2;
> +         bld.MOV(dest, read_result);
> 
> 
> If read_reg has a 32-bit type, you could use subscript here.  Meh.

Fixed locally.

> Reviewed-by: Jason Ekstrand <ja...@jlekstrand.net


Thanks for the reviews. This was the last pending review to address
before being ready to land this part of the series.

I'm waiting to confirm with Jenkins that I fixed a regression it found
in the 16-bit load_ubo implementation with the sampler.


>  
> 
>        }
>     } else if (type_sz(dest.type) == 4) {
>        fs_reg read_result = emit_untyped_read(bld, surf_index,
> offset_reg,
> --
> 2.11.0
> 
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler

2017-12-05 Thread Chema Casanova
On 05/12/17 19:53, Jason Ekstrand wrote:
> On Tue, Dec 5, 2017 at 9:08 AM, Chema Casanova <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>> wrote:
> 
> El 30/11/17 a las 23:58, Jason Ekstrand escribió:
> > On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
> > <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>
> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
> >
> >     load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
> >     surface format defined. So when reading 16-bit components with the
> >     sampler we need to unshuffle two 16-bit components from each
> 32-bit
> >     component.
> >
> >     Using the sampler avoids the use of the byte_scattered_read
> message
> >     that needs one message for each component and is supposed to be
> >     slower.
> >
> >     In the case of SKL+ we take advantage of a hardware feature that
> >     automatically defines a channel mask based on the rlen value,
> so on
> >     SKL+ we only use half of the registers without using a header
> in the
> >     payload.
> >     ---
> >      src/intel/compiler/brw_fs.cpp           | 31
> >     +++
> >      src/intel/compiler/brw_fs_generator.cpp | 10 --
> >      2 files changed, 35 insertions(+), 6 deletions(-)
> >
> >     diff --git a/src/intel/compiler/brw_fs.cpp
> >     b/src/intel/compiler/brw_fs.cpp
> >     index 1ca4d416b2..9c543496ba 100644
> >     --- a/src/intel/compiler/brw_fs.cpp
> >     +++ b/src/intel/compiler/brw_fs.cpp
> >     @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> >     fs_builder ,
> >          * a double this means we are only loading 2 elements worth of
> >     data.
> >          * We also want to use a 32-bit data type for the dst of the
> >     load operation
> >          * so other parts of the driver don't get confused about the
> >     size of the
> >     -    * result.
> >     +    * result. On the case of 16-bit data we only need half of the
> >     32-bit
> >     +    * components on SKL+ as we take advance of using message
> >     return size to
> >     +    * define an xy channel mask.
> >          */
> >     -   fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> >     +   fs_reg vec4_result;
> >     +   if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {
> >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);
> >     +      vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);
> >     +   } else {
> >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> >     +   }
> >
> >         fs_inst *inst =
> >     bld.emit(FS_OPCODE_VARYING_PULL_CONSTANT_LOAD_LOGICAL,
> >                                  vec4_result, surf_index,
> vec4_offset);
> >         inst->size_written = 4 *
> >     vec4_result.component_size(inst->exec_size);
> >     @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> >     fs_builder ,
> >         }
> >
> >         vec4_result.type = dst.type;
> >     -   bld.MOV(dst, offset(vec4_result, bld,
> >     -                       (const_offset & 0xf) /
> >     type_sz(vec4_result.type)));
> >     +
> >     +   if (type_sz(dst.type) == 2) {
> >     +      /* 16-bit types need to be unshuffled as each pair of
> >     16-bit components
> >     +       * is packed on a 32-bit compoment because we are using a
> >     32-bit format
> >     +       * in the surface of uniform that is read by the sampler.
> >     +       * TODO: On BDW+ mark when an uniform has 16-bit type so we
> >     could setup a
> >     +       * surface format of 16-bit and use the 16-bit return
> >     format at the
> >     +       * sampler.
> >     +       */
> >     +      vec4_result.stride = 2;
> >     +      bld.MOV(dst, byte_offset(offset(vec4_result, bld,
> >     +                                      (const_offset & 0x7) / 4),
> >     +                               (const_offset & 0x7) / 2 % 2 *
> 2));
> >     +   } else {
> >     +      bld.MOV(dst, offset(vec4_result, bld,
> >     +           

Re: [Mesa-dev] [PATCH v4 00/44] anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+

2017-12-05 Thread Chema Casanova
El 05/12/17 a las 06:16, Jason Ekstrand escribió:
> A couple of notes:
>
>  1) I *think* I gave you enough reviews to land the UBO/SSBO part and
> the optimizations in 26-28.  If reviews are still missing anywhere,
> please let me know.  If not, let's try and get that part landed.

The series is almost ready to land, I have only pending to address your
feedback about use untyped_read for reading vec3 ssbos.

The only missing explicit R-b is that " [PATCH v4 28/44] i965/fs: Use
untyped_surface_read for 16-bit load_ssbo" and "[PATCH v4 23/44]
i965/fs: Enables 16-bit load_ubo with sampler" i've just answered your
review to confirm the R-b.

I expect to finish today vec3 ssbo and send the series to Jenkins before
landing, confirm your "pending" R-b, do a last rebase to master and ask
for a push.

>
>  2) I send out a patch to rewrite assign_constant_locations which I
> think should make it automatically handle 8 and 16-bit values as
> well.  I'd rather do that than more special casing if everything works
> out ok.

I'm testing this patch with 16-bits and make sure whatever is needed to
have 16-bit working.

>
>  3) I sent out a series of patches to enable pushing of UBOs in
> Vulkan.  If we're not careful, these will clash with 16bit storage as
> UBO support suddenly has to imply push constant support.  That said,
> I"m willing to wait at least a little while before landing them to let
> us get 16bit push constant support sorted out.  The UBO pushing
> patches give us a nice little performance boost but we're nowhere near
> a release and I don't want it blocking you.

That would be my next priority, so we would only have pending to land
the 16-bit input/output support to finish this extension.

Chema

> On Wed, Nov 29, 2017 at 6:07 PM, Jose Maria Casanova Crespo
> > wrote:
>
> Hello,
>
> this is the V4 series for the implementation of the
> SPV_KHR_16bit_storage
> and VK_KHR_16bit_storage extensions on the anv vulkan driver, in
> addition
> to the GLSL and NIR support needed.
>
> The original series can be found here [1], the following v2 [2]
> and v3 [3].
>
> In short V4 includes the following:
>
>  * Reorder the series to enable features as they are implemented,
> the series
>    now enables first UBO and SSBO support, and then inputs/outputs and
>    finally push constants.
>  * Support the byte scattered read/write messages with different
> bit sizes
>    byte/word/dword.
>  * Refactor of the store_ssbo code and also fix stores when
> writemask was .yz
>  * Uses the sampler for load_ubo avoiding the initial
> implementation of
>    the series using byte_scattered_read.
>  * Addressed all the feedback provided by Jason and Topi on v3 review.
>
> This series is also available at:
>
> https://github.com/Igalia/mesa/tree/wip/VK_KHR_16bit_storage-rc4
> 
>
> The objective is to start landing part of this series, all
> feedback has been
> addressed for SSBO and UBO. But for input/outputs features it will
> probably
> need another iteration as was not completely reviewed. It is also
> needed
> to define the approach for push constants issues before of after
> landing
> the support with this implementation.
>
> Patches 1-5 and 8-17 have already been reviewed. Patch 7 was already
> reviewed but as it has changed too much i would appreciate another
> review. When patches until 25 or 28 are reviewed we could land
> UBOs and
> SSBOs support.
>
> Finally an updated overview of the patches:
>
> Patches 1-2 add 16-bit float, int and uint types to GLSL. This is
> needed because NIR uses GLSL types internally. We use the enums
> already defined at AMD_gpu_shader_half_float and NV_gpu_shader
> extensions. Patch 2 updates mesa/st, in order to avoid warnings for
> types not handled on a switch.
>
> Patches 3-6 add NIR support for those new GLSL 16-bit types,
> conversion opcodes, and rounding modes for float to half-float
> conversions.
>
> Patches 7-9 add the SPIR-V (SPV_KHR_16bit_storage) to NIR support.
>
> Patches 10-12 add general 16-bit support for i965. This includes
> handling of new types on several general purpose methods,
> update/remove some asserts.
>
> Patches 14-17 add support for 32 to 16-bit conversions for i965,
> including rounding mode opcodes (needed for float to half-float
> conversions), and an optimization that removes superfluous rounding
> mode sets.
>
> Patches 18-21 add and use two new messages: byte scattered read and
> write. Those were needed because untyped surface message has a fixed
> 32-bit write size. Those messages are used on the 16-bit support of
> store SSBO, load SSBO and load shared.
>
> Patch 22 adds helpers to allow un/shuffle 

Re: [Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler

2017-12-05 Thread Chema Casanova
El 30/11/17 a las 23:58, Jason Ekstrand escribió:
> On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
> > wrote:
>
> load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
> surface format defined. So when reading 16-bit components with the
> sampler we need to unshuffle two 16-bit components from each 32-bit
> component.
>
> Using the sampler avoids the use of the byte_scattered_read message
> that needs one message for each component and is supposed to be
> slower.
>
> In the case of SKL+ we take advantage of a hardware feature that
> automatically defines a channel mask based on the rlen value, so on
> SKL+ we only use half of the registers without using a header in the
> payload.
> ---
>  src/intel/compiler/brw_fs.cpp           | 31
> +++
>  src/intel/compiler/brw_fs_generator.cpp | 10 --
>  2 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs.cpp
> b/src/intel/compiler/brw_fs.cpp
> index 1ca4d416b2..9c543496ba 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> fs_builder ,
>      * a double this means we are only loading 2 elements worth of
> data.
>      * We also want to use a 32-bit data type for the dst of the
> load operation
>      * so other parts of the driver don't get confused about the
> size of the
> -    * result.
> +    * result. On the case of 16-bit data we only need half of the
> 32-bit
> +    * components on SKL+ as we take advance of using message
> return size to
> +    * define an xy channel mask.
>      */
> -   fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> +   fs_reg vec4_result;
> +   if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {
> +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);
> +      vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);
> +   } else {
> +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
> +   }
>
>     fs_inst *inst =
> bld.emit(FS_OPCODE_VARYING_PULL_CONSTANT_LOAD_LOGICAL,
>                              vec4_result, surf_index, vec4_offset);
>     inst->size_written = 4 *
> vec4_result.component_size(inst->exec_size);
> @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
> fs_builder ,
>     }
>
>     vec4_result.type = dst.type;
> -   bld.MOV(dst, offset(vec4_result, bld,
> -                       (const_offset & 0xf) /
> type_sz(vec4_result.type)));
> +
> +   if (type_sz(dst.type) == 2) {
> +      /* 16-bit types need to be unshuffled as each pair of
> 16-bit components
> +       * is packed on a 32-bit compoment because we are using a
> 32-bit format
> +       * in the surface of uniform that is read by the sampler.
> +       * TODO: On BDW+ mark when an uniform has 16-bit type so we
> could setup a
> +       * surface format of 16-bit and use the 16-bit return
> format at the
> +       * sampler.
> +       */
> +      vec4_result.stride = 2;
> +      bld.MOV(dst, byte_offset(offset(vec4_result, bld,
> +                                      (const_offset & 0x7) / 4),
> +                               (const_offset & 0x7) / 2 % 2 * 2));
> +   } else {
> +      bld.MOV(dst, offset(vec4_result, bld,
> +                          (const_offset & 0xf) /
> type_sz(vec4_result.type)));
> +   }
>
>
> This seems overly complicated.  How about something like

> fs_reg dw = offset(vec4_result, bld, (const_offset & 0xf) / 4);
> switch (type_sz(dst.type)) {
> case 2:
>    shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);
>    bld.MOV(dst, subscript(dw, dst.type, (const_offset / 2) & 1));
>    break;
> case 4:
>    bld.MOV(dst, dw);
>    break;
> case 8:
>    shuffle_32bit_load_result_to_64bit_data(bld, dst, dw, 1);
>    break;
> default:
>    unreachable();
> }

This implementation it is really more clear. Tested and works perfectly
fine.

>  
>
>  }
>
>  /**
> diff --git a/src/intel/compiler/brw_fs_generator.cpp
> b/src/intel/compiler/brw_fs_generator.cpp
> index a3861cd68e..00a4e29147 100644
> --- a/src/intel/compiler/brw_fs_generator.cpp
> +++ b/src/intel/compiler/brw_fs_generator.cpp
> @@ -1381,12 +1381,18 @@
> fs_generator::generate_varying_pull_constant_load_gen7(fs_inst *inst,
>     uint32_t simd_mode, rlen, mlen;
>     if (inst->exec_size == 16) {
>        mlen = 2;
> -      rlen = 8;
> +      if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))
> +         rlen = 4;
> +      else
> +         rlen = 8;
>
>
> I'm not sure what I think of this.  We intentionally use a vec4 today
> instead 

Re: [Mesa-dev] [PATCH v4 20/44] i965/fs: Add byte scattered read message and fs support

2017-12-04 Thread Chema Casanova
On 30/11/17 21:45, Jason Ekstrand wrote:
> On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
> > wrote:
> 
> v2: Fix alignment style (Topi Pohjolainen)
>     (Jason Ekstrand)
>     - Enable bit_size parameter to scattered messages to enable
> different
>       bitsizes byte/word/dword.
>     - Remove use of brw_send_indirect_scattered_message in favor of
>       brw_send_indirect_surface_message.
>     - Move scattered messages to surface messages namespace.
>     - Assert align1 for scattered messages and assume Gen8+.
>     - Inline brw_set_dp_byte_scattered_read.
> ---
>  src/intel/compiler/brw_eu.h                    |  8 +++
>  src/intel/compiler/brw_eu_defines.h            |  2 ++
>  src/intel/compiler/brw_eu_emit.c               | 30
> ++
>  src/intel/compiler/brw_fs.cpp                  | 19 
>  src/intel/compiler/brw_fs_copy_propagation.cpp |  2 ++
>  src/intel/compiler/brw_fs_generator.cpp        |  6 ++
>  src/intel/compiler/brw_fs_surface_builder.cpp  | 11 +-
>  src/intel/compiler/brw_fs_surface_builder.h    |  7 ++
>  src/intel/compiler/brw_shader.cpp              |  6 ++
>  9 files changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
> index 3ac3b4342a..2d0f56f793 100644
> --- a/src/intel/compiler/brw_eu.h
> +++ b/src/intel/compiler/brw_eu.h
> @@ -485,6 +485,14 @@ brw_typed_surface_write(struct brw_codegen *p,
>                          unsigned msg_length,
>                          unsigned num_channels);
> 
> +void
> +brw_byte_scattered_read(struct brw_codegen *p,
> +                        struct brw_reg dst,
> +                        struct brw_reg payload,
> +                        struct brw_reg surface,
> +                        unsigned msg_length,
> +                        unsigned bit_size);
> +
>  void
>  brw_byte_scattered_write(struct brw_codegen *p,
>                           struct brw_reg payload,
> diff --git a/src/intel/compiler/brw_eu_defines.h
> b/src/intel/compiler/brw_eu_defines.h
> index de6330ee54..aa510ebfa4 100644
> --- a/src/intel/compiler/brw_eu_defines.h
> +++ b/src/intel/compiler/brw_eu_defines.h
> @@ -409,6 +409,8 @@ enum opcode {
>      * opcode, but instead of taking a single payload blog they
> expect their
>      * arguments separately as individual sources, like untyped
> write/read.
>      */
> +   SHADER_OPCODE_BYTE_SCATTERED_READ,
> +   SHADER_OPCODE_BYTE_SCATTERED_READ_LOGICAL,
>     SHADER_OPCODE_BYTE_SCATTERED_WRITE,
>     SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL,
> 
> diff --git a/src/intel/compiler/brw_eu_emit.c
> b/src/intel/compiler/brw_eu_emit.c
> index ded7e228cf..bdc516848a 100644
> --- a/src/intel/compiler/brw_eu_emit.c
> +++ b/src/intel/compiler/brw_eu_emit.c
> @@ -2998,6 +2998,36 @@ static enum brw_data_size
> brw_data_size_from_bit_size(unsigned bit_size)
>     }
>  }
> 
> +
> +void
> +brw_byte_scattered_read(struct brw_codegen *p,
> +                        struct brw_reg dst,
> +                        struct brw_reg payload,
> +                        struct brw_reg surface,
> +                        unsigned msg_length,
> +                        unsigned bit_size)
> +{
> +   assert(brw_inst_access_mode(p->devinfo, p->current) == BRW_ALIGN_1);
> +   const struct gen_device_info *devinfo = p->devinfo;
> +   const unsigned sfid =  GEN7_SFID_DATAPORT_DATA_CACHE;
> +
> +   struct brw_inst *insn = brw_send_indirect_surface_message(
> +      p, sfid, dst, payload, surface, msg_length,
> +      brw_surface_payload_size(p, 1, true, true),
> +      false);
> +
> +   unsigned msg_control = brw_data_size_from_bit_size(bit_size) << 2;
> +
> +   if (brw_inst_exec_size(devinfo, p->current) == BRW_EXECUTE_16)
> +      msg_control |= 1; /* SIMD16 mode */
> +   else
> +      msg_control |= 0; /* SIMD8 mode */
> +
> +   brw_inst_set_dp_msg_type(devinfo, insn,
> +                            HSW_DATAPORT_DC_PORT0_BYTE_SCATTERED_READ);
> +   brw_inst_set_dp_msg_control(devinfo, insn, msg_control);
> +}
> +
>  void
>  brw_byte_scattered_write(struct brw_codegen *p,
>                           struct brw_reg payload,
> diff --git a/src/intel/compiler/brw_fs.cpp
> b/src/intel/compiler/brw_fs.cpp
> index 32f1d757f0..1ca4d416b2 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -251,6 +251,7 @@ fs_inst::is_send_from_grf() const
>     case SHADER_OPCODE_UNTYPED_SURFACE_READ:
>     case 

Re: [Mesa-dev] [PATCH v4 15/44] i965/fs: Define new shader opcode to set rounding modes

2017-12-04 Thread Chema Casanova
El 01/12/17 a las 09:06, Pohjolainen, Topi escribió:
> On Thu, Nov 30, 2017 at 03:07:59AM +0100, Jose Maria Casanova Crespo wrote:
>> From: Alejandro Piñeiro 
>>
>> Although it is possible to emit them directly as AND/OR on brw_fs_nir,
>> having a specific opcode makes it easier to remove duplicate settings
>> later.
>>
>> v2: (Curro)
>>   - Set thread control to 'switch' when using the control register
>>   - Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
>> with the rounding mode.
>>   - Avoid magic numbers setting rounding mode field at control register.
>> v3: (Curro)
>>   - Remove redundant and add missing whitespace lines.
>>   - Match printing instruction to IR opcode "rnd_mode"
>>
>> v4: (Topi Pohjolainen)
>>   - Fix code style.
>>
>> Signed-off-by:  Alejandro Piñeiro 
>> Signed-off-by:  Jose Maria Casanova Crespo 
>> Reviewed-by: Francisco Jerez 
>> Reviewed-by: Jason Ekstrand 
>> ---
>>  src/intel/compiler/brw_eu.h |  4 
>>  src/intel/compiler/brw_eu_defines.h | 16 
>>  src/intel/compiler/brw_eu_emit.c| 33 
>> +
>>  src/intel/compiler/brw_fs_generator.cpp |  5 +
>>  src/intel/compiler/brw_shader.cpp   |  4 
>>  5 files changed, 62 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>> index b5a206b3f1..343dcd867d 100644
>> --- a/src/intel/compiler/brw_eu.h
>> +++ b/src/intel/compiler/brw_eu.h
>> @@ -510,6 +510,10 @@ brw_broadcast(struct brw_codegen *p,
>>struct brw_reg src,
>>struct brw_reg idx);
>>  
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode);
>> +
>>  /***
>>   * brw_eu_util.c:
>>   */
>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>> b/src/intel/compiler/brw_eu_defines.h
>> index 291dd361a2..8a8f36cbc1 100644
>> --- a/src/intel/compiler/brw_eu_defines.h
>> +++ b/src/intel/compiler/brw_eu_defines.h
>> @@ -400,6 +400,8 @@ enum opcode {
>> SHADER_OPCODE_TYPED_SURFACE_WRITE,
>> SHADER_OPCODE_TYPED_SURFACE_WRITE_LOGICAL,
>>  
>> +   SHADER_OPCODE_RND_MODE,
>> +
>> SHADER_OPCODE_MEMORY_FENCE,
>>  
>> SHADER_OPCODE_GEN4_SCRATCH_READ,
>> @@ -1238,4 +1240,18 @@ enum brw_message_target {
>>  /* R0 */
>>  # define GEN7_GS_PAYLOAD_INSTANCE_ID_SHIFT  27
>>  
>> +/* CR0.0[5:4] Floating-Point Rounding Modes
>> + *  Skylake PRM, Volume 7 Part 1, "Control Register", page 756
>> + */
>> +
>> +#define BRW_CR0_RND_MODE_MASK 0x30
>> +#define BRW_CR0_RND_MODE_SHIFT4
>> +
>> +enum PACKED brw_rnd_mode {
>> +   BRW_RND_MODE_RTNE = 0,  /* Round to Nearest or Even */
>> +   BRW_RND_MODE_RU = 1,/* Round Up, toward +inf */
>> +   BRW_RND_MODE_RD = 2,/* Round Down, toward -inf */
>> +   BRW_RND_MODE_RTZ = 3,   /* Round Toward Zero */
>> +};
>> +
>>  #endif /* BRW_EU_DEFINES_H */
>> diff --git a/src/intel/compiler/brw_eu_emit.c 
>> b/src/intel/compiler/brw_eu_emit.c
>> index dc14023b48..ca97ff7325 100644
>> --- a/src/intel/compiler/brw_eu_emit.c
>> +++ b/src/intel/compiler/brw_eu_emit.c
>> @@ -3589,3 +3589,36 @@ brw_WAIT(struct brw_codegen *p)
>> brw_inst_set_exec_size(devinfo, insn, BRW_EXECUTE_1);
>> brw_inst_set_mask_control(devinfo, insn, BRW_MASK_DISABLE);
>>  }
>> +
>> +/**
>> + * Changes the floating point rounding mode updating the control register
>> + * field defined at cr0.0[5-6] bits. This function supports the changes to
>> + * RTNE (00), RU (01), RD (10) and RTZ (11) rounding using bitwise 
>> operations.
>> + * Only RTNE and RTZ rounding are enabled at nir.
>> + */
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode)
>> +{
>> +   const unsigned bits = mode << BRW_CR0_RND_MODE_SHIFT;
>> +
>> +   if (bits != BRW_CR0_RND_MODE_MASK) {
>> +  brw_inst *inst = brw_AND(p, brw_cr0_reg(0), brw_cr0_reg(0),
>> +   brw_imm_ud(~BRW_CR0_RND_MODE_MASK));
>> +
>> +  /* From the Skylake PRM, Volume 7, page 760:
>> +   *  "Implementation Restriction on Register Access: When the control
>> +   *   register is used as an explicit source and/or destination, 
>> hardware
>> +   *   does not ensure execution pipeline coherency. Software must set 
>> the
>> +   *   thread control field to ‘switch’ for an instruction that uses
> Putting "uses" to the next line would avoid overflowing the 80 column line
> width.

My editor says that that "uses" is at column 72, and previous lines
"hardware" and "the" are at column within limits on column 78...

Chema


>
>> +   *   control register as an explicit operand."
>> +   */
>> +  brw_inst_set_thread_control(p->devinfo, inst, BRW_THREAD_SWITCH);
>> +}
>> +
>> +   if (bits) {
>> +  brw_inst *inst = brw_OR(p, 

Re: [Mesa-dev] [PATCH v4 11/44] i965: Support for 16-bit base types in helper functions

2017-12-04 Thread Chema Casanova
El 01/12/17 a las 09:03, Pohjolainen, Topi escribió:
> On Thu, Nov 30, 2017 at 03:07:55AM +0100, Jose Maria Casanova Crespo wrote:
>> v2: Fixed calculation of scalar size for 16-bit types. (Jason Ekstrand)
>>
>> Signed-off-by: Jose Maria Casanova Crespo 
>> Signed-off-by: Eduardo Lima 
>> Reviewed-by: Jason Ekstrand 
>> ---
>>  src/intel/compiler/brw_fs.cpp |  4 
>>  src/intel/compiler/brw_nir.c  | 16 
>>  src/intel/compiler/brw_shader.cpp |  6 ++
>>  3 files changed, 26 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>> index 6772c0d5a5..6cdd2bd9f3 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -454,6 +454,10 @@ type_size_scalar(const struct glsl_type *type)
>> case GLSL_TYPE_FLOAT:
>> case GLSL_TYPE_BOOL:
>>return type->components();
>> +   case GLSL_TYPE_UINT16:
>> +   case GLSL_TYPE_INT16:
>> +   case GLSL_TYPE_FLOAT16:
>> +  return DIV_ROUND_UP(type->components(), 2);
>> case GLSL_TYPE_DOUBLE:
>> case GLSL_TYPE_UINT64:
>> case GLSL_TYPE_INT64:
>> diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
>> index 8f3f77f89a..cca4b45ae6 100644
>> --- a/src/intel/compiler/brw_nir.c
>> +++ b/src/intel/compiler/brw_nir.c
>> @@ -843,12 +843,18 @@ brw_type_for_nir_type(const struct gen_device_info 
>> *devinfo, nir_alu_type type)
>> case nir_type_float:
>> case nir_type_float32:
>>return BRW_REGISTER_TYPE_F;
>> +   case nir_type_float16:
>> +  return BRW_REGISTER_TYPE_HF;
>> case nir_type_float64:
>>return BRW_REGISTER_TYPE_DF;
>> case nir_type_int64:
>>return devinfo->gen < 8 ? BRW_REGISTER_TYPE_DF : BRW_REGISTER_TYPE_Q;
>> case nir_type_uint64:
>>return devinfo->gen < 8 ? BRW_REGISTER_TYPE_DF : BRW_REGISTER_TYPE_UQ;
>> +   case nir_type_int16:
>> +  return BRW_REGISTER_TYPE_W;
>> +   case nir_type_uint16:
>> +  return BRW_REGISTER_TYPE_UW;
>> default:
>>unreachable("unknown type");
>> }
>> @@ -867,6 +873,9 @@ brw_glsl_base_type_for_nir_type(nir_alu_type type)
>> case nir_type_float32:
>>return GLSL_TYPE_FLOAT;
>>  
>> +   case nir_type_float16:
>> +  return GLSL_TYPE_FLOAT16;
>> +
>> case nir_type_float64:
>>return GLSL_TYPE_DOUBLE;
>>  
>> @@ -878,6 +887,13 @@ brw_glsl_base_type_for_nir_type(nir_alu_type type)
>> case nir_type_uint32:
>>return GLSL_TYPE_UINT;
>>  
>> +   case nir_type_int16:
>> +  return GLSL_TYPE_INT16;
>> +
>> +   case nir_type_uint16:
>> +  return GLSL_TYPE_UINT16;
>> +
>> +
> Extra newline.

Fixed locally,

Thanks for the review.

Chema

>
>> default:
>>unreachable("bad type");
>> }
>> diff --git a/src/intel/compiler/brw_shader.cpp 
>> b/src/intel/compiler/brw_shader.cpp
>> index ba61481a0a..aa9e5f3d28 100644
>> --- a/src/intel/compiler/brw_shader.cpp
>> +++ b/src/intel/compiler/brw_shader.cpp
>> @@ -34,14 +34,20 @@ enum brw_reg_type
>>  brw_type_for_base_type(const struct glsl_type *type)
>>  {
>> switch (type->base_type) {
>> +   case GLSL_TYPE_FLOAT16:
>> +  return BRW_REGISTER_TYPE_HF;
>> case GLSL_TYPE_FLOAT:
>>return BRW_REGISTER_TYPE_F;
>> case GLSL_TYPE_INT:
>> case GLSL_TYPE_BOOL:
>> case GLSL_TYPE_SUBROUTINE:
>>return BRW_REGISTER_TYPE_D;
>> +   case GLSL_TYPE_INT16:
>> +  return BRW_REGISTER_TYPE_W;
>> case GLSL_TYPE_UINT:
>>return BRW_REGISTER_TYPE_UD;
>> +   case GLSL_TYPE_UINT16:
>> +  return BRW_REGISTER_TYPE_UW;
>> case GLSL_TYPE_ARRAY:
>>return brw_type_for_base_type(type->fields.array);
>> case GLSL_TYPE_STRUCT:
>> -- 
>> 2.14.3
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v4 01/44] glsl: Add 16-bit types

2017-12-04 Thread Chema Casanova

El 30/11/17 a las 10:25, Pohjolainen, Topi escribió:
> On Thu, Nov 30, 2017 at 03:07:45AM +0100, Jose Maria Casanova Crespo wrote:
>> From: Eduardo Lima Mitev 
> Just a few style nits, see below.
>
>> Adds new INT16, UINT16 and FLOAT16 base types.
>>
>> The corresponding GL types for half floats were reused from the
>> AMD_gpu_shader_half_float extension. The int16 and uint16 types come from
>> NV_gpu_shader_5 extension.
>>
>> This adds the builtins and the lexer support.
>>
>> To avoid a bunch of warnings due to cases not handled in switch, the
>> new types have been added to a few places using same behavior as
>> their 32-bit counterparts, except for a few trivial cases where they are
>> already handled properly. Subsequent patches in this set will provide
>> correct 16-bit implementations when needed.
>>
>> v2: * Use FLOAT16 instead of HALF_FLOAT as name of the base type.
>> * Removed float16_t from builtin types.
>> * Don't copy 16-bit types as if they were 32-bit values in
>>   copy_constant_to_storage().
>> * Use get_scalar_type() instead of adding a new custom switch
>>   statement.
>> (Jason Ekstrand)
>> v3: Use GL_FLOAT16_NV instead of GL_HALF_FLOAT for consistency
>> (Ilia Mirkin)
>> v4: Add missing 16-bit base types support in glsl_to_nir (Eduardo Lima).
>>
>> Signed-off-by: Jose Maria Casanova Crespo 
>> Signed-off-by: Eduardo Lima 
>> Signed-off-by: Alejandro Piñeiro 
>> Reviewed-by: Jason Ekstrand 
>> Reviewed-by: Nicolai Hähnle 
>> ---
>>  src/compiler/builtin_type_macros.h  | 26 +++
>>  src/compiler/glsl/ast_to_hir.cpp|  3 +
>>  src/compiler/glsl/glsl_to_nir.cpp   |  6 +-
>>  src/compiler/glsl/ir_clone.cpp  |  3 +
>>  src/compiler/glsl/link_uniform_initializers.cpp |  3 +
>>  src/compiler/glsl/lower_buffer_access.cpp   |  3 +-
>>  src/compiler/glsl_types.cpp | 93 
>> -
>>  src/compiler/glsl_types.h   | 10 ++-
>>  src/mesa/program/ir_to_mesa.cpp |  6 ++
>>  9 files changed, 145 insertions(+), 8 deletions(-)
>>
>> diff --git a/src/compiler/builtin_type_macros.h 
>> b/src/compiler/builtin_type_macros.h
>> index a275617b34..e3a1cd29c8 100644
>> --- a/src/compiler/builtin_type_macros.h
>> +++ b/src/compiler/builtin_type_macros.h
>> @@ -62,6 +62,22 @@ DECL_TYPE(mat3x4, GL_FLOAT_MAT3x4, GLSL_TYPE_FLOAT, 4, 3)
>>  DECL_TYPE(mat4x2, GL_FLOAT_MAT4x2, GLSL_TYPE_FLOAT, 2, 4)
>>  DECL_TYPE(mat4x3, GL_FLOAT_MAT4x3, GLSL_TYPE_FLOAT, 3, 4)
>>  
>> +DECL_TYPE(float16_t, GL_FLOAT16_NV,GLSL_TYPE_FLOAT16, 1, 1)
>> +DECL_TYPE(f16vec2,   GL_FLOAT16_VEC2_NV,   GLSL_TYPE_FLOAT16, 2, 1)
>> +DECL_TYPE(f16vec3,   GL_FLOAT16_VEC3_NV,   GLSL_TYPE_FLOAT16, 3, 1)
>> +DECL_TYPE(f16vec4,   GL_FLOAT16_VEC4_NV,   GLSL_TYPE_FLOAT16, 4, 1)
>> +
>> +DECL_TYPE(f16mat2,   GL_FLOAT16_MAT2_AMD,   GLSL_TYPE_FLOAT16, 2, 2)
>> +DECL_TYPE(f16mat3,   GL_FLOAT16_MAT3_AMD,   GLSL_TYPE_FLOAT16, 3, 3)
>> +DECL_TYPE(f16mat4,   GL_FLOAT16_MAT4_AMD,   GLSL_TYPE_FLOAT16, 4, 4)
>> +
>> +DECL_TYPE(f16mat2x3, GL_FLOAT16_MAT2x3_AMD, GLSL_TYPE_FLOAT16, 3, 2)
>> +DECL_TYPE(f16mat2x4, GL_FLOAT16_MAT2x4_AMD, GLSL_TYPE_FLOAT16, 4, 2)
>> +DECL_TYPE(f16mat3x2, GL_FLOAT16_MAT3x2_AMD, GLSL_TYPE_FLOAT16, 2, 3)
>> +DECL_TYPE(f16mat3x4, GL_FLOAT16_MAT3x4_AMD, GLSL_TYPE_FLOAT16, 4, 3)
>> +DECL_TYPE(f16mat4x2, GL_FLOAT16_MAT4x2_AMD, GLSL_TYPE_FLOAT16, 2, 4)
>> +DECL_TYPE(f16mat4x3, GL_FLOAT16_MAT4x3_AMD, GLSL_TYPE_FLOAT16, 3, 4)
>> +
>>  DECL_TYPE(double,  GL_DOUBLE,GLSL_TYPE_DOUBLE, 1, 1)
>>  DECL_TYPE(dvec2,   GL_DOUBLE_VEC2,   GLSL_TYPE_DOUBLE, 2, 1)
>>  DECL_TYPE(dvec3,   GL_DOUBLE_VEC3,   GLSL_TYPE_DOUBLE, 3, 1)
>> @@ -88,6 +104,16 @@ DECL_TYPE(u64vec2,  GL_UNSIGNED_INT64_VEC2_ARB, 
>> GLSL_TYPE_UINT64, 2, 1)
>>  DECL_TYPE(u64vec3,  GL_UNSIGNED_INT64_VEC3_ARB, GLSL_TYPE_UINT64, 3, 1)
>>  DECL_TYPE(u64vec4,  GL_UNSIGNED_INT64_VEC4_ARB, GLSL_TYPE_UINT64, 4, 1)
>>  
>> +DECL_TYPE(int16_t,  GL_INT16_NV,  GLSL_TYPE_INT16, 1, 1)
>> +DECL_TYPE(i16vec2,  GL_INT16_VEC2_NV, GLSL_TYPE_INT16, 2, 1)
>> +DECL_TYPE(i16vec3,  GL_INT16_VEC3_NV, GLSL_TYPE_INT16, 3, 1)
>> +DECL_TYPE(i16vec4,  GL_INT16_VEC4_NV, GLSL_TYPE_INT16, 4, 1)
>> +
>> +DECL_TYPE(uint16_t, GL_UNSIGNED_INT16_NV,  GLSL_TYPE_UINT16, 1, 1)
>> +DECL_TYPE(u16vec2,  GL_UNSIGNED_INT16_VEC2_NV, GLSL_TYPE_UINT16, 2, 1)
>> +DECL_TYPE(u16vec3,  GL_UNSIGNED_INT16_VEC3_NV, GLSL_TYPE_UINT16, 3, 1)
>> +DECL_TYPE(u16vec4,  GL_UNSIGNED_INT16_VEC4_NV, GLSL_TYPE_UINT16, 4, 1)
>> +
>>  DECL_TYPE(sampler,   GL_SAMPLER_1D,   
>> GLSL_TYPE_SAMPLER, GLSL_SAMPLER_DIM_1D,   0, 0, GLSL_TYPE_VOID)
>>  DECL_TYPE(sampler1D, GL_SAMPLER_1D,   
>> GLSL_TYPE_SAMPLER, GLSL_SAMPLER_DIM_1D,   0, 0, GLSL_TYPE_FLOAT)
>>  DECL_TYPE(sampler2D, GL_SAMPLER_2D,  

Re: [Mesa-dev] [PATCH v4 18/44] i965/fs: Add byte scattered write message and fs support

2017-12-01 Thread Chema Casanova
On 30/11/17 21:42, Jason Ekstrand wrote:
> On Wed, Nov 29, 2017 at 6:08 PM, Jose Maria Casanova Crespo
> > wrote:
> 
> v2: (Jason Ekstrand)
>     - Enable bit_size parameter to scattered messages to enable
> different
>       bitsizes byte/word/dword.
>     - Remove use of brw_send_indirect_scattered_message in favor of
>       brw_send_indirect_surface_message.
>     - Move scattered messages to surface messages namespace.
>     - Assert align1 for scattered messages and assume Gen8+.
>     - Inline brw_set_dp_byte_scattered_write.
> 
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  >
> ---
>  src/intel/compiler/brw_eu.h                    |  7 +
>  src/intel/compiler/brw_eu_defines.h            | 17 +++
>  src/intel/compiler/brw_eu_emit.c               | 42
> ++
>  src/intel/compiler/brw_fs.cpp                  | 14 +
>  src/intel/compiler/brw_fs_copy_propagation.cpp |  2 ++
>  src/intel/compiler/brw_fs_generator.cpp        |  6 
>  src/intel/compiler/brw_fs_surface_builder.cpp  | 11 +++
>  src/intel/compiler/brw_fs_surface_builder.h    |  7 +
>  src/intel/compiler/brw_shader.cpp              |  7 +
>  9 files changed, 113 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
> index 343dcd867d..3ac3b4342a 100644
> --- a/src/intel/compiler/brw_eu.h
> +++ b/src/intel/compiler/brw_eu.h
> @@ -485,6 +485,13 @@ brw_typed_surface_write(struct brw_codegen *p,
>                          unsigned msg_length,
>                          unsigned num_channels);
> 
> +void
> +brw_byte_scattered_write(struct brw_codegen *p,
> +                         struct brw_reg payload,
> +                         struct brw_reg surface,
> +                         unsigned msg_length,
> +                         unsigned bit_size);
> +
>  void
>  brw_memory_fence(struct brw_codegen *p,
>                   struct brw_reg dst);
> diff --git a/src/intel/compiler/brw_eu_defines.h
> b/src/intel/compiler/brw_eu_defines.h
> index 9d5cf05c86..de6330ee54 100644
> --- a/src/intel/compiler/brw_eu_defines.h
> +++ b/src/intel/compiler/brw_eu_defines.h
> @@ -402,6 +402,16 @@ enum opcode {
> 
>     SHADER_OPCODE_RND_MODE,
> 
> +   /**
> +    * Byte scattered write/read opcodes.
> +    *
> +    * LOGICAL opcodes are eventually translated to the matching
> non-LOGICAL
> +    * opcode, but instead of taking a single payload blog they
> expect their
> +    * arguments separately as individual sources, like untyped
> write/read.
> +    */
> +   SHADER_OPCODE_BYTE_SCATTERED_WRITE,
> +   SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL,
> +
>     SHADER_OPCODE_MEMORY_FENCE,
> 
>     SHADER_OPCODE_GEN4_SCRATCH_READ,
> @@ -1255,4 +1265,11 @@ enum PACKED brw_rnd_mode {
>     BRW_RND_MODE_UNSPECIFIED,  /* Unspecified rounding mode */
>  };
> 
> +/* MDC_DS - Data Size Message Descriptor Control Field */
> +enum PACKED brw_data_size {
> 
> 
> I'm not sure how I feel about this being an enum with such a generic name.

Right, PRM use a more exactly "Data Elements" but this field only used
byte_scattered/scaled writes/reads. As I will follow your next
suggestion of using #define, I'm chaging the name to:

#define GEN7_BYTE_SCATTERED_DATA_ELEMENT_BYTE  0
#define GEN7_BYTE_SCATTERED_DATA_ELEMENT_WORD  1
#define GEN7_BYTE_SCATTERED_DATA_ELEMENT_DWORD 2

I'll include in the comment about MSC_DS
"Specifies the number of Bytes to be read or written per Dword used at
byte_scattered read/write and byte_scaled read/write messages."

>  
> 
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_BYTE = 0,
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_WORD = 1,
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_DWORD = 2
> +};
> +
>  #endif /* BRW_EU_DEFINES_H */
> diff --git a/src/intel/compiler/brw_eu_emit.c
> b/src/intel/compiler/brw_eu_emit.c
> index ca97ff7325..ded7e228cf 100644
> --- a/src/intel/compiler/brw_eu_emit.c
> +++ b/src/intel/compiler/brw_eu_emit.c
> @@ -2580,6 +2580,7 @@ brw_send_indirect_surface_message(struct
> brw_codegen *p,
>     return insn;
>  }
> 
> +
>  static bool
>  while_jumps_before_offset(const struct gen_device_info *devinfo,
>                            brw_inst *insn, int while_offset, int
> start_offset)
> @@ -2983,6 +2984,47 @@ brw_untyped_surface_write(struct brw_codegen *p,
>        p, insn, num_channels);
>  }
> 
> +static enum brw_data_size brw_data_size_from_bit_size(unsigned
> bit_size)
> 
> 
> Please put the return type on 

Re: [Mesa-dev] [PATCH v4 07/44] spirv/nir: Handle 16-bit types

2017-12-01 Thread Chema Casanova
On 30/11/17 21:24, Jason Ekstrand wrote:
> I sprinkled a few mostly trivial comments below.  With those fixed,
> 
> Reviewed-by: Jason Ekstrand <ja...@jlekstrand.net
> <mailto:ja...@jlekstrand.net>>
> 
> On Wed, Nov 29, 2017 at 6:07 PM, Jose Maria Casanova Crespo
> <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>> wrote:
> 
> From: Eduardo Lima Mitev <el...@igalia.com <mailto:el...@igalia.com>>
> 
> v2: Added more missing implementations of 16-bit types. (Jason Ekstrand)
> 
> v3: Store values in values[0].u16[i] (Jason Ekstrand)
>     Include switches based on bitsize for 16-bit types
>     (Chema Casanova)
> 
> Signed-off-by: Jose Maria Casanova Crespo <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>>
> Signed-off-by: Eduardo Lima <el...@igalia.com <mailto:el...@igalia.com>>
> ---
>  src/compiler/spirv/spirv_to_nir.c  | 111
> +++--
>  src/compiler/spirv/vtn_variables.c |  21 +++
>  2 files changed, 115 insertions(+), 17 deletions(-)
> 
> diff --git a/src/compiler/spirv/spirv_to_nir.c
> b/src/compiler/spirv/spirv_to_nir.c
> index 027efab88d..f745373473 100644
> --- a/src/compiler/spirv/spirv_to_nir.c
> +++ b/src/compiler/spirv/spirv_to_nir.c
> @@ -104,10 +104,13 @@ vtn_const_ssa_value(struct vtn_builder *b,
> nir_constant *constant,
>     switch (glsl_get_base_type(type)) {
>     case GLSL_TYPE_INT:
>     case GLSL_TYPE_UINT:
> +   case GLSL_TYPE_INT16:
> +   case GLSL_TYPE_UINT16:
>     case GLSL_TYPE_INT64:
>     case GLSL_TYPE_UINT64:
>     case GLSL_TYPE_BOOL:
>     case GLSL_TYPE_FLOAT:
> +   case GLSL_TYPE_FLOAT16:
>     case GLSL_TYPE_DOUBLE: {
>        int bit_size = glsl_get_bit_size(type);
>        if (glsl_type_is_vector_or_scalar(type)) {
> @@ -751,16 +754,38 @@ vtn_handle_type(struct vtn_builder *b, SpvOp
> opcode,
>        int bit_size = w[2];
>        const bool signedness = w[3];
>        val->type->base_type = vtn_base_type_scalar;
> -      if (bit_size == 64)
> +      switch (bit_size) {
> +      case 64:
>           val->type->type = (signedness ? glsl_int64_t_type() :
> glsl_uint64_t_type());
> -      else
> +         break;
> +      case 32:
>           val->type->type = (signedness ? glsl_int_type() :
> glsl_uint_type());
> +         break;
> +      case 16:
> +         val->type->type = (signedness ? glsl_int16_t_type() :
> glsl_uint16_t_type());
> +         break;
> +      default:
> +         unreachable("Invalid int bit size");
> +      }
>        break;
>     }
> +
>     case SpvOpTypeFloat: {
>        int bit_size = w[2];
>        val->type->base_type = vtn_base_type_scalar;
> -      val->type->type = bit_size == 64 ? glsl_double_type() :
> glsl_float_type();
> +      switch (bit_size) {
> +      case 16:
> +         val->type->type = glsl_float16_t_type();
> +         break;
> +      case 32:
> +         val->type->type = glsl_float_type();
> +         break;
> +      case 64:
> +         val->type->type = glsl_double_type();
> +         break;
> +      default:
> +         assert(!"Invalid float bit size");
> 
> 
> unreachable()

Fixed locally.

> +      }
>        break;
>     }
> 
> @@ -980,10 +1005,13 @@ vtn_null_constant(struct vtn_builder *b,
> const struct glsl_type *type)
>     switch (glsl_get_base_type(type)) {
>     case GLSL_TYPE_INT:
>     case GLSL_TYPE_UINT:
> +   case GLSL_TYPE_INT16:
> +   case GLSL_TYPE_UINT16:
>     case GLSL_TYPE_INT64:
>     case GLSL_TYPE_UINT64:
>     case GLSL_TYPE_BOOL:
>     case GLSL_TYPE_FLOAT:
> +   case GLSL_TYPE_FLOAT16:
>     case GLSL_TYPE_DOUBLE:
>        /* Nothing to do here.  It's already initialized to zero */
>        break;
> @@ -1106,12 +1134,20 @@ vtn_handle_constant(struct vtn_builder *b,
> SpvOp opcode,
>     case SpvOpConstant: {
>        assert(glsl_type_is_scalar(val->const_type));
>        int bit_size = glsl_get_bit_size(val->const_type);
> -      if (bit_size == 64) {
> +      switch (bit_size) {
> +      case 64: {
>           val->constant->values->u32[0] = w[3];
>           val->constant->values->u32[1] = w[4];
> 
> 
> A bit unrelated 

Re: [Mesa-dev] [PATCH v4 28/44] i965/fs: Use untyped_surface_read for 16-bit load_ssbo

2017-12-01 Thread Chema Casanova
On 01/12/17 11:49, Jason Ekstrand wrote:
> On Wed, Nov 29, 2017 at 6:57 PM, Jose Maria Casanova Crespo
> <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>> wrote:
> 
> SSBO loads were using byte_scattered read messages as they allow
> reading 16-bit size components. byte_scattered messages can only
> operate one component at a time so we needed to emit as many messages
> as components.
> 
> But for vec2 and vec4 of 16-bit, being multiple of 32-bit we can use the
> untyped_surface_read message to read pairs of 16-bit components
> using only
> one message. Once each pair is read it is unshuffled to return the
> proper
> 16-bit components.
> 
> On 16-bit scalar and vec3 16-bit the not paired component is read using
> only one byte_scattered_read message.
> 
> 
> My gut tells me that, for vec3's, we'd be better off with a single
> untyped read than one untyped read and one byte scattered read.  Also,
> are there alignment issues with untyped surface reads/writes that might
> cause us problems on vec3's?  I don't know what the alignment rules are
> for 16-bit vec3's in Vulkan.

I think that untyped_read will work perfectly fine with vec3 as there
are not special rules for 16-bits. The only thing would be that we would
writing always the unused 4th component, so we decided to play save and
just modify what was expected and only scattered write allowed that with
that approach:

"* A three- or four-component vector, with components of size N, has a
base alignment of 4 N."

I was trying for this V4 of the series, to use untyped_surface_read for
all the cases, but I focused on scalar ones, without success. But for
vec3 it should be easy to do if we can assume to write random data at
the 4th component.

>  
> 
> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
>     Rework optimization using unshuffle 16 reads (Chema Casanova)
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 43
> ++-
>  1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index fa7aa9c247..57e79853ef 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -2354,16 +2354,39 @@ do_untyped_vector_read(const fs_builder ,
>           bld.ADD(read_offset, read_offset, brw_imm_ud(16));
>        }
>     } else if (type_sz(dest.type) == 2) {
> -      fs_reg read_offset = bld.vgrf(BRW_REGISTER_TYPE_UD);
> -      bld.MOV(read_offset, offset_reg);
> -      for (unsigned i = 0; i < num_components; i++) {
> -         fs_reg read_reg = emit_byte_scattered_read(bld,
> surf_index, read_offset,
> -                                                    1 /* dims */,
> -                                                    1,
> -                                                    16 /*bit_size */,
> -                                                   
> BRW_PREDICATE_NONE);
> -         bld.MOV(offset(dest,bld,i), subscript(read_reg, dest.type,
> 0));
> -         bld.ADD(read_offset, read_offset,
> brw_imm_ud(type_sz(dest.type)));
> +      assert(dest.stride == 1);
> +
> +      int component_pairs = num_components / 2;
> +      /* Pairs of 16-bit components can be read with untyped read */
> +      if (component_pairs > 0) {
> +         fs_reg read_result = emit_untyped_read(bld, surf_index,
> +                                                offset_reg,
> +                                                1 /* dims */,
> +                                                component_pairs,
> +                                                BRW_PREDICATE_NONE);
> +         shuffle_32bit_load_result_to_16bit_data(bld,
> +               retype(dest, BRW_REGISTER_TYPE_HF),
> +               retype(read_result, BRW_REGISTER_TYPE_F),
> 
> 
> I'd rather we use W and D rather than HF and F.  Rounding errors scare me.

Ok.

Thanks for the review.

Chema

> +               component_pairs * 2);
> +      }
> +      /* Last component of vec3 and scalar 16-bit read needs to be read
> +       * using one byte_scattered_read message
> +       */
> +      if (num_components % 2) {
> +         fs_reg read_offset = bld.vgrf(BRW_REGISTER_TYPE_UD);
> +         bld.ADD(read_offset,
> +                 offset_reg,
> +                 brw_imm_ud((num_components - 1) *
> type_sz(dest.type)));
> +         fs_reg read_result = emit_byte_scattered_read(bld, surf_index,
> +        

Re: [Mesa-dev] [PATCH v4 19/44] i965/fs: Use byte_scattered_write on 16-bit store_ssbo

2017-12-01 Thread Chema Casanova
On 01/12/17 11:12, Jason Ekstrand wrote:
> I've left some comments below that I think clean things up and make this
> better, but I believe it is correct as-is.
> 
> Reviewed-by: Jason Ekstrand  >
> 
> On Wed, Nov 29, 2017 at 6:42 PM, Jose Maria Casanova Crespo
> > wrote:
> 
> From: Alejandro Piñeiro  >
> 
> We need to rely on byte scattered writes as untyped writes are 32-bit
> size. We could try to keep using 32-bit messages when we have two or
> four 16-bit elements, but for simplicity sake, we use the same message
> for any component number. We revisit this aproach in the follwing
> patches.
> 
> v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand)
> 
> v3: (Jason Ekstrand)
>     - Include bit_size to scattered write message and remove namespace
>     - specific for scattered messages.
>     - Move comment to proper place.
>     - Squashed with i965/fs: Adjust type_size/type_slots on store_ssbo.
>     (Jose Maria Casanova)
>     - Take into account that get_nir_src returns now WORD types for
>       16-bit sources instead of DWORD.
> 
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  >
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 51
> ---
>  1 file changed, 37 insertions(+), 14 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index d6ab286147..ff04e2468b 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4075,14 +4075,15 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>         * Also, we have to suffle 64-bit data to be in the
> appropriate layout
>         * expected by our 32-bit write messages.
>         */
> -      unsigned type_size = 4;
> -      if (nir_src_bit_size(instr->src[0]) == 64) {
> -         type_size = 8;
> +      unsigned bit_size = nir_src_bit_size(instr->src[0]);
> +      unsigned type_size = bit_size / 8;
> +      if (bit_size == 64) {
>           val_reg = shuffle_64bit_data_for_32bit_write(bld,
>              val_reg, instr->num_components);
>        }
> 
> -      unsigned type_slots = type_size / 4;
> +      /* 16-bit types would use a minimum of 1 slot */
> +      unsigned type_slots = MAX2(type_size / 4, 1);
> 
> 
> Given that this is only used for emit_typed_write, maybe we should just
> move it next to the emit_typed_write call and just get rid of the
> MAX2().  More on that later.

It makes sanes, i follow partially this approach at "[PATCH v4 26/44]
i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg"
using an slots_per_component that is just 2 for 64-bits and 1 for the
other bitsizes. But i like your approach.

>        /* Combine groups of consecutive enabled channels in one write
>         * message. We use ffs to find the first enabled channel and
> then ffs on
> @@ -4093,12 +4094,19 @@ fs_visitor::nir_emit_intrinsic(const
> fs_builder , nir_intrinsic_instr *instr
>           unsigned first_component = ffs(writemask) - 1;
>           unsigned length = ffs(~(writemask >> first_component)) - 1;
> 
> 
> If the one above is first_component, num_components would be a better
> name for this one.  It's very confusing go have something generically
> named "length" in a piece of code with so many different possible units.

It was also confussing to me. What about a rename to
num_consecutive_components as that what is really calculating? so we
don't confuse it with the num_components of instr.

> -         /* We can't write more than 2 64-bit components at once.
> Limit the
> -          * length of the write to what we can do and let the next
> iteration
> -          * handle the rest
> -          */
> -         if (type_size > 4)
> +         if (type_size > 4) {
> +            /* We can't write more than 2 64-bit components at
> once. Limit
> +             * the length of the write to what we can do and let
> the next
> +             * iteration handle the rest.
> +             */
>              length = MIN2(2, length);
> +         } else if (type_size == 2) {
> 
> 
> Maybe type_size < 4?

I should have advanced this change to this patch, you commented it
already for current [PATCH v4 26/44]


> +            /* For 16-bit types we are using byte scattered writes,
> that can
> +             * only write one component per call. So we limit the
> length, and
> +             * let the write 

Re: [Mesa-dev] [PATCH v4 22/44] i965/fs: Helpers for un/shuffle 16-bit pairs in 32-bit components

2017-11-30 Thread Chema Casanova


On 30/11/17 23:21, Jason Ekstrand wrote:
> On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
> > wrote:
> 
> This helpers are used to load/store 16-bit types from/to 32-bit
> components.
> 
> The functions shuffle_32bit_load_result_to_16bit_data and
> shuffle_16bit_data_for_32bit_write are implemented in a similar
> way than the analogous functions for handling 64-bit types.
> ---
>  src/intel/compiler/brw_fs.h       | 11 +
>  src/intel/compiler/brw_fs_nir.cpp | 51
> +++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
> index 19b897e7a9..30557324d5 100644
> --- a/src/intel/compiler/brw_fs.h
> +++ b/src/intel/compiler/brw_fs.h
> @@ -497,6 +497,17 @@ void
> shuffle_32bit_load_result_to_64bit_data(const brw::fs_builder ,
>  fs_reg shuffle_64bit_data_for_32bit_write(const brw::fs_builder ,
>                                            const fs_reg ,
>                                            uint32_t components);
> +
> +void shuffle_32bit_load_result_to_16bit_data(const brw::fs_builder
> ,
> +                                             const fs_reg ,
> +                                             const fs_reg ,
> +                                             uint32_t components);
> +
> +void shuffle_16bit_data_for_32bit_write(const brw::fs_builder ,
> +                                        const fs_reg ,
> +                                        const fs_reg ,
> +                                        uint32_t components);
> +
>  fs_reg setup_imm_df(const brw::fs_builder ,
>                      double v);
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 726b2fcee7..c091241132 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4828,6 +4828,33 @@ shuffle_32bit_load_result_to_64bit_data(const
> fs_builder ,
>     }
>  }
> 
> +void
> +shuffle_32bit_load_result_to_16bit_data(const fs_builder ,
> +                                        const fs_reg ,
> +                                        const fs_reg ,
> +                                        uint32_t components)
> +{
> +   assert(type_sz(src.type) == 4);
> +   assert(type_sz(dst.type) == 2);
> +
> +   fs_reg tmp = retype(bld.vgrf(src.type), dst.type);
> +
> +   for (unsigned i = 0; i < components; i++) {
> +      const fs_reg component_i = subscript(offset(src, bld, i / 2),
> dst.type, i % 2);
> +
> +      bld.MOV(offset(tmp, bld, i % 2), component_i);
> +
> +      if (i % 2) {
> +         bld.MOV(offset(dst, bld, i -1), offset(tmp, bld, 0));
> +         bld.MOV(offset(dst, bld, i), offset(tmp, bld, 1));
> +      }
> 
> 
> I'm very confused by this extra moving.  Why can't we just do
> 
> bld.MOV(offset(dst, bld, i), component_i);
> 
> above?  Maybe I'm missing something but I don't see why the extra moves
> are needed.


There is a comment in the previous function
shuffle_32bit_load_result_to_64bit_data that explains the similar
situation that still applies in shuffle_32bit_load_result_to_16bit_data,
what about including the following comment.

/* A temporary that we will use to shuffle the 16-bit data of each
 * component in the vector into valid 32-bit data. We can't write
 * directly to dst because dst can be the same as src and in that
 * case the first MOV in the loop below would overwrite the data
 * read in the second MOV.
 */

But in any case I've just checked, and at first sight at the 6 final
uses of this function this situation never happens.

>  
> 
> +   }
> +   if (components % 2) {
> +      bld.MOV(offset(dst, bld, components - 1), tmp);
> +   }
> +}
> +
> +
>  /**
>   * This helper does the inverse operation of
>   * SHUFFLE_32BIT_LOAD_RESULT_TO_64BIT_DATA.
> @@ -4860,6 +4887,30 @@ shuffle_64bit_data_for_32bit_write(const
> fs_builder ,
>     return dst;
>  }
> 
> +void
> +shuffle_16bit_data_for_32bit_write(const fs_builder ,
> +                                   const fs_reg ,
> +                                   const fs_reg ,
> +                                   uint32_t components)
> +{
> +   assert(type_sz(src.type) == 2);
> +   assert(type_sz(dst.type) == 4);
> +
> +   fs_reg tmp = bld.vgrf(dst.type);
> +
> +   for (unsigned i = 0; i < components; i++) {
> +      const fs_reg component_i = offset(src, bld, i);
> +      bld.MOV(subscript(tmp, src.type, i % 2), component_i);
> +      if (i % 2) {
> +         bld.MOV(offset(dst, bld, i / 2), tmp);
> +      }
> 
> 
> Again, why the extra MOVs?  

Re: [Mesa-dev] i965: Kicking off fp16 glsl support

2017-11-27 Thread Chema Casanova
El 27/11/17 a las 21:11, Matt Turner escribió:
> 1-14, except 4 are
>
> Reviewed-by: Matt Turner <matts...@gmail.com>
>
> I started getting to things that made me realize I needed to review
> Igalia's work before I continued here.

I'm submitting tomorrow the v4 for our VK_KHR_16bit_storage series. So
better have a look to the new one.

Chema Casanova
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 20/43] i965/fs: Add byte scattered write message and fs support

2017-11-19 Thread Chema Casanova
On 31/10/17 01:02, Jason Ekstrand wrote:
> On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  >
> ---
>  src/intel/compiler/brw_eu.h                    |  6 ++
>  src/intel/compiler/brw_eu_defines.h            | 17 +
>  src/intel/compiler/brw_eu_emit.c               | 89
> ++
>  src/intel/compiler/brw_fs.cpp                  | 10 +++
>  src/intel/compiler/brw_fs_copy_propagation.cpp |  2 +
>  src/intel/compiler/brw_fs_generator.cpp        |  5 ++
>  src/intel/compiler/brw_fs_surface_builder.cpp  | 17 +
>  src/intel/compiler/brw_fs_surface_builder.h    |  9 +++
>  src/intel/compiler/brw_shader.cpp              |  7 ++
>  9 files changed, 162 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
> index 145942a54f..b44ca0f518 100644
> --- a/src/intel/compiler/brw_eu.h
> +++ b/src/intel/compiler/brw_eu.h
> @@ -476,6 +476,12 @@ brw_typed_surface_write(struct brw_codegen *p,
>                          unsigned num_channels);
> 
>  void
> +brw_byte_scattered_write(struct brw_codegen *p,
> +                         struct brw_reg payload,
> +                         struct brw_reg surface,
> +                         unsigned msg_length);
> +
> +void
>  brw_memory_fence(struct brw_codegen *p,
>                   struct brw_reg dst);
> 
> diff --git a/src/intel/compiler/brw_eu_defines.h
> b/src/intel/compiler/brw_eu_defines.h
> index 1751f18293..9aac385ba7 100644
> --- a/src/intel/compiler/brw_eu_defines.h
> +++ b/src/intel/compiler/brw_eu_defines.h
> @@ -390,6 +390,16 @@ enum opcode {
> 
>     SHADER_OPCODE_RND_MODE,
> 
> +   /**
> +    * Byte scattered write/read opcodes.
> +    *
> +    * LOGICAL opcodes are eventually translated to the matching
> non-LOGICAL
> +    * opcode, but instead of taking a single payload blog they
> expect their
> +    * arguments separately as individual sources, like untyped
> write/read.
> +    */
> +   SHADER_OPCODE_BYTE_SCATTERED_WRITE,
> +   SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL,
> +
>     SHADER_OPCODE_MEMORY_FENCE,
> 
>     SHADER_OPCODE_GEN4_SCRATCH_READ,
> @@ -1231,4 +1241,11 @@ enum PACKED brw_rnd_mode {
>     BRW_RND_MODE_UNSPECIFIED,  /* Unspecified rounding mode */
>  };
> 
> +/* MDC_DS - Data Size Message Descriptor Control Field */
> +enum PACKED brw_data_size {
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_BYTE = 0,
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_WORD = 1,
> +   GEN7_BYTE_SCATTERED_DATA_SIZE_DWORD = 2
> +};
> +
>  #endif /* BRW_EU_DEFINES_H */
> diff --git a/src/intel/compiler/brw_eu_emit.c
> b/src/intel/compiler/brw_eu_emit.c
> index 8c1e4c5eae..84d85be653 100644
> --- a/src/intel/compiler/brw_eu_emit.c
> +++ b/src/intel/compiler/brw_eu_emit.c
> @@ -2483,6 +2483,49 @@ brw_send_indirect_surface_message(struct
> brw_codegen *p,
>     return insn;
>  }
> 
> +
> +static struct brw_inst *
> +brw_send_indirect_scattered_message(struct brw_codegen *p,
> +                                    unsigned sfid,
> +                                    struct brw_reg dst,
> +                                    struct brw_reg payload,
> +                                    struct brw_reg surface,
> +                                    unsigned message_len,
> +                                    unsigned response_len,
> +                                    bool header_present)
> 
> 
> How is this any different from brw_send_indirect_surface_message?  They
> look identical except for the fact that this one is missing the explicit
> brw_set_default_exec_size I added to the other as part of my subgroup
> series.  If there's no real difference, let's delete this one and just
> use the other.  You can make a pretty good case that the scattered byte
> messages are "surface" messages.

There was no real difference, so I've modified them to use the
brw_send_indirect_surface_message, at the end it just makes sense to
move all scattered_message to the surface_access namespace. Following
the same reasoning.

> +{
> +   const struct gen_device_info *devinfo = p->devinfo;
> +   struct brw_inst *insn;
> +
> +   if (surface.file != BRW_IMMEDIATE_VALUE) {
> +      struct brw_reg addr = retype(brw_address_reg(0),
> BRW_REGISTER_TYPE_UD);
> +
> +      brw_push_insn_state(p);
> +      brw_set_default_access_mode(p, BRW_ALIGN_1);
> +      brw_set_default_mask_control(p, BRW_MASK_DISABLE);
> +      

Re: [Mesa-dev] [PATCH v3 17/43] i965/fs: Enable rounding mode on f2f16 ops

2017-11-13 Thread Chema Casanova
On 30/10/17 23:40, Jason Ekstrand wrote:
> On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo
> > wrote:
> 
> From: Alejandro Piñeiro  >
> 
> By default we don't set the rounding mode. We only set
> round-to-near-even or round-to-zero mode if explicitly set from nir.
> 
> v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
>     with the rounding mode (Curro)
> 
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  >
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 6908c7ea02..b356836e80 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -693,6 +693,14 @@ fs_visitor::nir_emit_alu(const fs_builder ,
> nir_alu_instr *instr)
>        inst->saturate = instr->dest.saturate;
>        break;
> 
> +   case nir_op_f2f16_rtne:
> +   case nir_op_f2f16_rtz:
> +      if (instr->op == nir_op_f2f16_rtz)
> +         bld.emit(SHADER_OPCODE_RND_MODE, bld.null_reg_ud(),
> brw_imm_d(BRW_RND_MODE_RTZ));
> +      else if (instr->op == nir_op_f2f16_rtne)
> +         bld.emit(SHADER_OPCODE_RND_MODE, bld.null_reg_ud(),
> brw_imm_d(BRW_RND_MODE_RTNE));
> +      /* fallthrough */
> 
> 
> It might look a little nicer (though it's more lines of code) to have a
> little brw_from_nir_rounding_mode helper and then we could have just the
> one emit call.  I don't care too much though.


What about this helper?

static brw_rnd_mode
brw_rnd_mode_from_nir_op (const nir_op op) {
   switch (op) {
   case nir_op_f2f16_rtz:
  return BRW_RND_MODE_RTZ;
   case nir_op_f2f16_rtne:
  return BRW_RND_MODE_RTNE;
   default:
  unreachable("Operation doesn't support rounding mode");
   }
}

And ...

   case nir_op_f2f16_rtne:
   case nir_op_f2f16_rtz:
  bld.emit(SHADER_OPCODE_RND_MODE, bld.null_reg_ud(),
   brw_imm_d(brw_rnd_mode_from_nir_op(instr->op)));


> 
> +
>        /* In theory, it would be better to use BRW_OPCODE_F32TO16.
> Depending
>         * on the HW gen, it is a special hw opcode or just a MOV, and
>         * brw_F32TO16 (at brw_eu_emit) would do the work to chose.
> --
> 2.13.6
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> 
> 
> 
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3] compiler: Mark when input/ouput attribute at VS uses 16-bit (v2)

2017-11-02 Thread Chema Casanova
El 02/11/17 a las 19:25, Jason Ekstrand escribió:
> On Thu, Nov 2, 2017 at 11:17 AM, Chema Casanova <jmcasan...@igalia.com
> <mailto:jmcasan...@igalia.com>> wrote:
>
>
>
> El 01/11/17 a las 22:07, Jason Ekstrand escribió:
> > On Tue, Oct 17, 2017 at 10:05 AM, Jose Maria Casanova Crespo
> > <jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>
> <mailto:jmcasan...@igalia.com <mailto:jmcasan...@igalia.com>>> wrote:
> >
> >     New shader attribute to mark when a location has 16-bit
> >     value. This patch includes support on mesa glsl and nir.
> >
> >     v2: Remove use of is_half_slot as is a duplicate of is_16bit
> >         (Topi Pohjolainen)
> >     ---
> >      src/compiler/glsl_types.h          | 15 +++
> >      src/compiler/nir/nir_gather_info.c | 21 ++---
> >      src/compiler/shader_info.h         |  2 ++
> >      3 files changed, 31 insertions(+), 7 deletions(-)
> >
> >     diff --git a/src/compiler/glsl_types.h
> b/src/compiler/glsl_types.h
> >     index 32399df351..e35e8d8f88 100644
> >     --- a/src/compiler/glsl_types.h
> >     +++ b/src/compiler/glsl_types.h
> >     @@ -93,6 +93,13 @@ static inline bool
> >     glsl_base_type_is_integer(enum glsl_base_type type)
> >                type == GLSL_TYPE_IMAGE;
> >      }
> >
> >     +static inline bool glsl_base_type_is_16bit(enum
> glsl_base_type type)
> >     +{
> >     +   return type == GLSL_TYPE_FLOAT16 ||
> >     +          type == GLSL_TYPE_UINT16 ||
> >     +          type == GLSL_TYPE_INT16;
> >     +}
> >     +
> >      enum glsl_sampler_dim {
> >         GLSL_SAMPLER_DIM_1D = 0,
> >         GLSL_SAMPLER_DIM_2D,
> >     @@ -555,6 +562,14 @@ struct glsl_type {
> >         }
> >
> >         /**
> >     +    * Query whether or not a type is 16-bit
> >     +    */
> >     +   bool is_16bit() const
> >     +   {
> >     +      return glsl_base_type_is_16bit(base_type);
> >     +   }
> >     +
> >     +   /**
> >          * Query whether or not a type is a non-array boolean type
> >          */
> >         bool is_boolean() const
> >     diff --git a/src/compiler/nir/nir_gather_info.c
> >     b/src/compiler/nir/nir_gather_info.c
> >     index ac87bec46c..cce64f9c84 100644
> >     --- a/src/compiler/nir/nir_gather_info.c
> >     +++ b/src/compiler/nir/nir_gather_info.c
> >     @@ -212,14 +212,20 @@ gather_intrinsic_info(nir_intrinsic_instr
> >     *instr, nir_shader *shader)
> >               if (!try_mask_partial_io(shader, instr->variables[0]))
> >                  mark_whole_variable(shader, var);
> >
> >     -         /* We need to track which input_reads bits
> correspond to a
> >     -          * dvec3/dvec4 input attribute */
> >     +         /* We need to track which input_reads bits
> correspond to
> >     +          * dvec3/dvec4 or 16-bit  input attributes */
> >               if (shader->stage == MESA_SHADER_VERTEX &&
> >     -             var->data.mode == nir_var_shader_in &&
> >     -           
>  glsl_type_is_dual_slot(glsl_without_array(var->type))) {
> >     -            for (uint i = 0; i <
> >     glsl_count_attribute_slots(var->type, false); i++) {
> >     -               int idx = var->data.location + i;
> >     -               shader->info.double_inputs_read |=
> >     BITFIELD64_BIT(idx);
> >     +             var->data.mode == nir_var_shader_in) {
> >     +            if
> >     (glsl_type_is_dual_slot(glsl_without_array(var->type))) {
> >     +               for (uint i = 0; i <
> >     glsl_count_attribute_slots(var->type, false); i++) {
> >     +                  int idx = var->data.location + i;
> >     +                  shader->info.double_inputs_read |=
> >     BITFIELD64_BIT(idx);
> >     +               }
> >     +            } else if
> >     (glsl_get_bit_size(glsl_without_array(var->type)) == 16) {
> >     +               for (uint i = 0; i <
> >     glsl_count_attribute_slots(var->type, false); i++) {
> >     +                  int idx = var->data.location + i;
&g

Re: [Mesa-dev] [PATCH v3] compiler: Mark when input/ouput attribute at VS uses 16-bit (v2)

2017-11-02 Thread Chema Casanova


El 01/11/17 a las 22:07, Jason Ekstrand escribió:
> On Tue, Oct 17, 2017 at 10:05 AM, Jose Maria Casanova Crespo
> > wrote:
>
> New shader attribute to mark when a location has 16-bit
> value. This patch includes support on mesa glsl and nir.
>
> v2: Remove use of is_half_slot as is a duplicate of is_16bit
>     (Topi Pohjolainen)
> ---
>  src/compiler/glsl_types.h          | 15 +++
>  src/compiler/nir/nir_gather_info.c | 21 ++---
>  src/compiler/shader_info.h         |  2 ++
>  3 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/src/compiler/glsl_types.h b/src/compiler/glsl_types.h
> index 32399df351..e35e8d8f88 100644
> --- a/src/compiler/glsl_types.h
> +++ b/src/compiler/glsl_types.h
> @@ -93,6 +93,13 @@ static inline bool
> glsl_base_type_is_integer(enum glsl_base_type type)
>            type == GLSL_TYPE_IMAGE;
>  }
>
> +static inline bool glsl_base_type_is_16bit(enum glsl_base_type type)
> +{
> +   return type == GLSL_TYPE_FLOAT16 ||
> +          type == GLSL_TYPE_UINT16 ||
> +          type == GLSL_TYPE_INT16;
> +}
> +
>  enum glsl_sampler_dim {
>     GLSL_SAMPLER_DIM_1D = 0,
>     GLSL_SAMPLER_DIM_2D,
> @@ -555,6 +562,14 @@ struct glsl_type {
>     }
>
>     /**
> +    * Query whether or not a type is 16-bit
> +    */
> +   bool is_16bit() const
> +   {
> +      return glsl_base_type_is_16bit(base_type);
> +   }
> +
> +   /**
>      * Query whether or not a type is a non-array boolean type
>      */
>     bool is_boolean() const
> diff --git a/src/compiler/nir/nir_gather_info.c
> b/src/compiler/nir/nir_gather_info.c
> index ac87bec46c..cce64f9c84 100644
> --- a/src/compiler/nir/nir_gather_info.c
> +++ b/src/compiler/nir/nir_gather_info.c
> @@ -212,14 +212,20 @@ gather_intrinsic_info(nir_intrinsic_instr
> *instr, nir_shader *shader)
>           if (!try_mask_partial_io(shader, instr->variables[0]))
>              mark_whole_variable(shader, var);
>
> -         /* We need to track which input_reads bits correspond to a
> -          * dvec3/dvec4 input attribute */
> +         /* We need to track which input_reads bits correspond to
> +          * dvec3/dvec4 or 16-bit  input attributes */
>           if (shader->stage == MESA_SHADER_VERTEX &&
> -             var->data.mode == nir_var_shader_in &&
> -             glsl_type_is_dual_slot(glsl_without_array(var->type))) {
> -            for (uint i = 0; i <
> glsl_count_attribute_slots(var->type, false); i++) {
> -               int idx = var->data.location + i;
> -               shader->info.double_inputs_read |=
> BITFIELD64_BIT(idx);
> +             var->data.mode == nir_var_shader_in) {
> +            if
> (glsl_type_is_dual_slot(glsl_without_array(var->type))) {
> +               for (uint i = 0; i <
> glsl_count_attribute_slots(var->type, false); i++) {
> +                  int idx = var->data.location + i;
> +                  shader->info.double_inputs_read |=
> BITFIELD64_BIT(idx);
> +               }
> +            } else if
> (glsl_get_bit_size(glsl_without_array(var->type)) == 16) {
> +               for (uint i = 0; i <
> glsl_count_attribute_slots(var->type, false); i++) {
> +                  int idx = var->data.location + i;
> +                  shader->info.half_inputs_read |=
> BITFIELD64_BIT(idx);
> +               }
>              }
>           }
>        }
> @@ -312,6 +318,7 @@ nir_shader_gather_info(nir_shader *shader,
> nir_function_impl *entrypoint)
>     shader->info.outputs_written = 0;
>     shader->info.outputs_read = 0;
>     shader->info.double_inputs_read = 0;
> +   shader->info.half_inputs_read = 0;
>     shader->info.patch_inputs_read = 0;
>     shader->info.patch_outputs_written = 0;
>     shader->info.system_values_read = 0;
> diff --git a/src/compiler/shader_info.h b/src/compiler/shader_info.h
> index 38413940d6..98111fa1e0 100644
> --- a/src/compiler/shader_info.h
> +++ b/src/compiler/shader_info.h
> @@ -55,6 +55,8 @@ typedef struct shader_info {
>     uint64_t inputs_read;
>     /* Which inputs are actually read and are double */
>     uint64_t double_inputs_read;
> +   /* Which inputs are actually read and are half */
> +   uint64_t half_inputs_read;
>
>
> Given that we're flagging this for 16-bit integers, I don't think
> "half" is really appropriate.  How about just 16bit_inputs_read?

I thought about that, but we can not do that. As the C restriction of
variable names starting with alphabet or underscore. As the logic was
the same as for double I didn't want to go for a inputs_read_16bits. I
didn't come up with a better 

Re: [Mesa-dev] [PATCH v3 00/43] anv: SPV_KHR_16bit_storage/VK_KHR_16bit_storage for gen8+

2017-11-02 Thread Chema Casanova
El 02/11/17 a las 01:43, Jason Ekstrand escribió:
> I'm done reading for the day.  As you're working on incorporating
> feedback, I'd  like you to re-arrange things a bit so that we do
> everything required to enable VK_KHR_16bit_storage (including
> advertising the Vulkan extension string) for SSBOs and UBOs first and
> then enable it for push constants and enable it for inputs/outputs
> last.  This way we can land the most important part (UBOs and SSBOs)
> soon and the more annoying parts can get the review time that they need.

I think that is a good approach, I'll reorder the series so we can land
and enable the UBO/SSBOs without the other capabilities.

Chema

>
> On Mon, Oct 30, 2017 at 5:20 PM, Jason Ekstrand  > wrote:
>
> Patches 1-5, 8-11, and 13-18 are
>
> Reviewed-by: Jason Ekstrand  >
>
> On Mon, Oct 16, 2017 at 8:23 AM, Pohjolainen, Topi
> >
> wrote:
>
> On Mon, Oct 16, 2017 at 08:03:41AM -0700, Jason Ekstrand wrote:
> > FYI: I'm planning to review this some time this week. 
> Probably not today
> > though.
>
> Great, I was hoping you would. I'm just reading out of
> curiosity and asking
> random questions. Mostly trying to remind myself how compiler
> works :) It has
> been a while since I had anything to do with it.
>
> >
> > On Thu, Oct 12, 2017 at 11:37 AM, Jose Maria Casanova Crespo <
> > jmcasan...@igalia.com > wrote:
> >
> > > Hello,
> > >
> > > this is the V3 series for the implementation of the
> > > SPV_KHR_16bit_storage and VK_KHR_16bit_storage extensions
> on the anv
> > > vulkan driver, in addition to the GLSL and NIR support needed.
> > >
> > > The original series can be found here [1], and the V2 is
> available
> > > here [2].
> > >
> > > In short V3 includes the following:
> > >
> > >  * Updates on several patches after the review of the V2
> series.
> > >    This includes some squashes, and specially changes so
> 16-bit
> > >    types are always packed, not using stride 2 by default.
> > >    This implied a re-implementation of all
> load_input/store_output
> > >    intrinsics for 16-bit. New solution shuffles and unshuffles
> > >    16-bit components in 32-bit URB write and read
> operations. This
> > >    saves space in the URB writes and reduces the register
> pressure
> > >    just using half of the space.
> > >
> > > * 5 patches have been removed from v2 series because now
> we not
> > >    assume the stride 2 for 16-bit registers. We also
> removed the
> > >    patch of reuse_16bit_conversion_register. The problems
> related
> > >    to spilling that motivate that patch were better
> addressed by
> > >    Curro's liveness patch.
> > >
> > >    i965/fs: Set stride 2 when dealing with 16-bit floats/ints
> > >    i965/fs: Retype 16-bit/stride2 movs to UD on nir_op_vecX
> > >    i965/fs: Need to allocate as minimum 32-bit register
> > >    i965/fs: Update assertion on copy propagation
> > >    i965/fs: Add reuse_16bit_conversions_register optimization
> > >
> > > Finally an updated overview of the patches:
> > >
> > > Patches 1-2 add 16-bit float, int and uint types to GLSL.
> This is
> > > needed because NIR uses GLSL types internally. We use the
> enums
> > > already defined at AMD_gpu_shader_half_float and NV_gpu_shader
> > > extensions. Patch 4 updates mesa/st, in order to avoid
> warnings for
> > > types not handled on a switch.
> > >
> > > Patches 3-6 add NIR support for those new GLSL 16-bit types,
> > > conversion opcodes, and rounding modes for float to half-float
> > > conversions.
> > >
> > > Patches 7-9 add the SPIR-V (SPV_KHR_16bit_storage) to NIR
> support.
> > >
> > > Patches 10-13 add general 16-bit support for i965. This
> includes
> > > handling of new types on several general purpose methods,
> > > update/remove some asserts.
> > >
> > > Patches 14-18 add support for 32 to 16-bit conversions for
> i965,
> > > including rounding mode opcodes (needed for float to
> half-float
> > > conversions), and an optimization that removes superfluous
> rounding
> > > mode sets.
> > >
> > > Patch 19 adds 16-bit support for constant location.
> > >
>  

Re: [Mesa-dev] [PATCH v3 17/43] i965/fs: Enable rounding mode on f2f16 ops

2017-11-02 Thread Chema Casanova


El 30/10/17 a las 23:40, Jason Ekstrand escribió:
> On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo
> > wrote:
>
> From: Alejandro Piñeiro  >
>
> By default we don't set the rounding mode. We only set
> round-to-near-even or round-to-zero mode if explicitly set from nir.
>
> v2: Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
>     with the rounding mode (Curro)
>
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  >
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 6908c7ea02..b356836e80 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -693,6 +693,14 @@ fs_visitor::nir_emit_alu(const fs_builder
> , nir_alu_instr *instr)
>        inst->saturate = instr->dest.saturate;
>        break;
>
> +   case nir_op_f2f16_rtne:
> +   case nir_op_f2f16_rtz:
> +      if (instr->op == nir_op_f2f16_rtz)
> +         bld.emit(SHADER_OPCODE_RND_MODE, bld.null_reg_ud(),
> brw_imm_d(BRW_RND_MODE_RTZ));
> +      else if (instr->op == nir_op_f2f16_rtne)
> +         bld.emit(SHADER_OPCODE_RND_MODE, bld.null_reg_ud(),
> brw_imm_d(BRW_RND_MODE_RTNE));
> +      /* fallthrough */
>
>
> It might look a little nicer (though it's more lines of code) to have
> a little brw_from_nir_rounding_mode helper and then we could have just
> the one emit call.  I don't care too much though.

I agree, and it could simplify if we want to enable other rounding modes
supported by the hardware in the future.

>  
>
> +
>        /* In theory, it would be better to use BRW_OPCODE_F32TO16.
> Depending
>         * on the HW gen, it is a special hw opcode or just a MOV, and
>         * brw_F32TO16 (at brw_eu_emit) would do the work to chose.
> --
> 2.13.6
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org 
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
>
>
>
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 12/43] i965/fs: Add brw_reg_type_from_bit_size utility method

2017-11-02 Thread Chema Casanova


El 30/10/17 a las 23:15, Jason Ekstrand escribió:
>
> On Mon, Oct 30, 2017 at 3:08 PM, Jason Ekstrand  > wrote:
>
> On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo
> > wrote:
>
> From: Alejandro Piñeiro  >
>
> Returns the brw_type for a given ssa.bit_size, and a reference
> type.
> So if bit_size is 64, and the reference type is
> BRW_REGISTER_TYPE_F,
> it returns BRW_REGISTER_TYPE_DF. The same applies if bit_size
> is 32
> and reference type is BRW_REGISTER_TYPE_HF it returns
> BRW_REGISTER_TYPE_F
>
> v2 (Jason Ekstrand):
>  - Use better unreachable() messages
>  - Add Q types
>
> Signed-off-by: Jose Maria Casanova Crespo
> >
> Signed-off-by: Alejandro Piñeiro  
> Reviewed-by: Jason Ekstrand  >
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 69
> ---
>  1 file changed, 64 insertions(+), 5 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 7ed44f534c..affe65d5e9 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -227,6 +227,65 @@ fs_visitor::nir_emit_system_values()
>     }
>  }
>
> +/*
> + * Returns a type based on a reference_type (word, float,
> half-float) and a
> + * given bit_size.
> + *
> + * Reference BRW_REGISTER_TYPE are HF,F,DF,W,D,UW,UD.
> + *
> + * @FIXME: 64-bit return types are always DF on integer types
> to maintain
> + * compability with uses of DF previously to the introduction
> of int64
> + * support.
>
>
> I just read this comment and I really don't like it.  This is going to
> come back to bite us if we don't fix it some better way.  How many
> places do we actually need to override to DF?  I suppose we'll need it
> for intrinsics and a couple of ALU operations such as bcsel.  I'd like
> to keep it as contained as we can.

We have doubts about this behavior also that was the reason of the
@fixme. We created this function as we were using a similar switch in 4
places when giving the 16bit_storage support over the 64bits. You can
check the uses at the end of the patch. Two of them were originally
BRW_REGISTER_TYPE_D : BRW_REGISTER_TYPE_DF the others where as expected
BRW_REGISTER_TYPE_F : BRW_REGISTER_TYPE_DF.

As Q types weren't used in this cases that was the reason to return DF
and avoid doing a retype with its conditional if needed, to not change
original code.

In any case, i will check if there are any regressions for this cases
changing the return types.

Thanks for the review.

Chema

>
> --Jason
>  
>
> + */
> +static brw_reg_type
> +brw_reg_type_from_bit_size(const unsigned bit_size,
> +                           const brw_reg_type reference_type)
> +{
> +   switch(reference_type) {
> +   case BRW_REGISTER_TYPE_HF:
> +   case BRW_REGISTER_TYPE_F:
> +   case BRW_REGISTER_TYPE_DF:
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_HF;
> +      case 32:
> +         return BRW_REGISTER_TYPE_F;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
> +      default:
> +         unreachable("Invalid bit size");
> +      }
> +   case BRW_REGISTER_TYPE_W:
> +   case BRW_REGISTER_TYPE_D:
> +   case BRW_REGISTER_TYPE_Q:
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_W;
> +      case 32:
> +         return BRW_REGISTER_TYPE_D;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
>
>
> This should be BRW_REGISTER_TYPE_Q
>  
>
> +      default:
> +         unreachable("Invalid bit size");
> +      }
> +   case BRW_REGISTER_TYPE_UW:
> +   case BRW_REGISTER_TYPE_UD:
> +   case BRW_REGISTER_TYPE_UQ:
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_UW;
> +      case 32:
> +         return BRW_REGISTER_TYPE_UD;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
>
>
> This should be BRW_REGISTER_TYPE_UQ
>
> With those fixed,
>
> Reviewed-by: Jason Ekstrand  >
>  
>

Re: [Mesa-dev] [PATCH v3 06/43] nir: Handle fp16 rounding modes at nir_type_conversion_op

2017-11-02 Thread Chema Casanova


El 30/10/17 a las 22:26, Jason Ekstrand escribió:
> On Thu, Oct 12, 2017 at 11:37 AM, Jose Maria Casanova Crespo
> > wrote:
>
> nir_type_conversion enables new operations to handle rounding modes to
> convert to fp16 values. Two new opcodes are enabled nir_op_f2f16_rtne
> and nir_op_f2f16_rtz.
>
> The undefined behaviour doesn't has any effect and uses the original
> nir_op_f2f16 operation.
>
> v2: Indentation fixed (Jason Ekstrand)
> ---
>  src/compiler/glsl/glsl_to_nir.cpp |  3 ++-
>  src/compiler/nir/nir.h            |  3 ++-
>  src/compiler/nir/nir_opcodes.py   | 10 --
>  src/compiler/nir/nir_opcodes_c.py | 15 ++-
>  src/compiler/spirv/vtn_alu.c      |  2 +-
>  5 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/src/compiler/glsl/glsl_to_nir.cpp
> b/src/compiler/glsl/glsl_to_nir.cpp
> index 9f25e30678..5738979b19 100644
> --- a/src/compiler/glsl/glsl_to_nir.cpp
> +++ b/src/compiler/glsl/glsl_to_nir.cpp
> @@ -1575,7 +1575,8 @@ nir_visitor::visit(ir_expression *ir)
>     case ir_unop_u642i64: {
>        nir_alu_type src_type =
> nir_get_nir_type_for_glsl_base_type(types[0]);
>        nir_alu_type dst_type =
> nir_get_nir_type_for_glsl_base_type(out_type);
> -      result = nir_build_alu(, nir_type_conversion_op(src_type,
> dst_type),
> +      result = nir_build_alu(, nir_type_conversion_op(src_type,
> dst_type,
> +                                 nir_rounding_mode_undef),
>                                   srcs[0], NULL, NULL, NULL);
>        /* b2i and b2f don't have fixed bit-size versions so the
> builder will
>         * just assume 32 and we have to fix it up here.
> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
> index fb269fcb28..93f0d52804 100644
> --- a/src/compiler/nir/nir.h
> +++ b/src/compiler/nir/nir.h
> @@ -753,7 +753,8 @@ nir_get_nir_type_for_glsl_type(const struct
> glsl_type *type)
>     return
> nir_get_nir_type_for_glsl_base_type(glsl_get_base_type(type));
>  }
>
> -nir_op nir_type_conversion_op(nir_alu_type src, nir_alu_type dst);
> +nir_op nir_type_conversion_op(nir_alu_type src, nir_alu_type dst,
> +                              nir_rounding_mode rnd);
>
>  typedef enum {
>     NIR_OP_IS_COMMUTATIVE = (1 << 0),
> diff --git a/src/compiler/nir/nir_opcodes.py
> b/src/compiler/nir/nir_opcodes.py
> index 06ae820c3e..0abc34f037 100644
> --- a/src/compiler/nir/nir_opcodes.py
> +++ b/src/compiler/nir/nir_opcodes.py
> @@ -179,8 +179,14 @@ for src_t in [tint, tuint, tfloat]:
>        else:
>           bit_sizes = [8, 16, 32, 64]
>        for bit_size in bit_sizes:
> -         unop_convert("{0}2{1}{2}".format(src_t[0], dst_t[0],
> bit_size),
> -                      dst_t + str(bit_size), src_t, "src0")
> +          if bit_size == 16 and dst_t == tfloat and src_t == tfloat:
> +              rnd_modes = ['rtne', 'rtz']
> +              for rnd_mode in rnd_modes:
> +                  unop_convert("{0}2{1}{2}_{3}".format(src_t[0],
> dst_t[0],
> +                                                       bit_size,
> rnd_mode),
> +                               dst_t + str(bit_size), src_t, "src0")
> +          unop_convert("{0}2{1}{2}".format(src_t[0], dst_t[0],
> bit_size),
> +                       dst_t + str(bit_size), src_t, "src0")
>
>  # We'll hand-code the to/from bool conversion opcodes.  Because
> bool doesn't
>  # have multiple bit-sizes, we can always infer the size from the
> other type.
> diff --git a/src/compiler/nir/nir_opcodes_c.py
> b/src/compiler/nir/nir_opcodes_c.py
> index 02bb4738ed..95a76ea39f 100644
> --- a/src/compiler/nir/nir_opcodes_c.py
> +++ b/src/compiler/nir/nir_opcodes_c.py
> @@ -30,7 +30,7 @@ template = Template("""
>  #include "nir.h"
>
>  nir_op
> -nir_type_conversion_op(nir_alu_type src, nir_alu_type dst)
> +nir_type_conversion_op(nir_alu_type src, nir_alu_type dst,
> nir_rounding_mode rnd)
>  {
>     nir_alu_type src_base = (nir_alu_type)
> nir_alu_type_get_base_type(src);
>     nir_alu_type dst_base = (nir_alu_type)
> nir_alu_type_get_base_type(dst);
> @@ -64,7 +64,20 @@ nir_type_conversion_op(nir_alu_type src,
> nir_alu_type dst)
>                 switch (dst_bit_size) {
>  %                 for dst_bits in [16, 32, 64]:
>                    case ${dst_bits}:
> +%                    if src_t == 'float' and dst_t == 'float' and
> dst_bits == 16:
> +                     switch(rnd) {
> +%                       for rnd_t in ['rtne', 'rtz']:
> +                        case nir_rounding_mode_${rnd_t}:
> +                           return
> 

Re: [Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

2017-10-30 Thread Chema Casanova
El 30/10/17 a las 07:44, Pohjolainen, Topi escribió:
> On Sun, Oct 29, 2017 at 11:17:11PM +0100, Chema Casanova wrote:
>> On 29/10/17 19:55, Pohjolainen, Topi wrote:
>>> On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
>>>> We enable the use of 16-bit values in push constants
>>>> modifying the assign_constant_locations function to work
>>>> with 16-bit types.
>>>>
>>>> The API to access buffers in Vulkan use multiples of 4-byte for
>>>> offsets and sizes. Current accountability of uniforms based on 4-byte
>>>> slots will work for 16-bit values if they are allowed to use 32-bit
>>>> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
>>>> 2-byte elements will use 1 slot instead of 0.
>>>>
>>>> We aligns the 16-bit locations after assigning the 32-bit
>>>> ones.
>>>> ---
>>>>  src/intel/compiler/brw_fs.cpp | 30 +++---
>>>>  1 file changed, 23 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>> index a1d49a63be..8da16145dc 100644
>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>> @@ -1909,8 +1909,9 @@ set_push_pull_constant_loc(unsigned uniform, int 
>>>> *chunk_start,
>>>> if (!contiguous) {
>>>>/* If bitsize doesn't match the target one, skip it */
>>>>if (*max_chunk_bitsize != target_bitsize) {
>>>> - /* FIXME: right now we only support 32 and 64-bit accesses */
>>>> - assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
>>>> + assert(*max_chunk_bitsize == 4 ||
>>>> +*max_chunk_bitsize == 8 ||
>>>> +*max_chunk_bitsize == 2);
>>>>   *max_chunk_bitsize = 0;
>>>>   *chunk_start = -1;
>>>>   return;
>>>> @@ -1987,8 +1988,9 @@ fs_visitor::assign_constant_locations()
>>>>   int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
>>> Did you test this with, for example, vec4?
>> CTS has 16bit scalar, vec2 (uint,sint), vec4 (float) and matrix tests
>> for push constants for compute and graphics pipelines. For vec4 you can try:
>>
>> dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant_16_to_32.vector_float
>>
>> For push constant tests in general there are 42 tests, but vec3 aren't
>> tested:
>>
>> dEQP-VK.*16bit_storage.*push_constant.
>>
>>
>>> I've been toying with a glsl
>>> lowering pass changing mediump floats into float16. I was curious to know 
>>> how
>>> much is needed as you have addressed most of the things from NIR onwards.
>>> Here I'm seeing offsets 0,2,4,6 which result into 0,0,1,1 when divided by
>>> four. Don't we need something of this sort in addition?
>> If i remember correctly, tests were testing to use push constants with
>> 64 16bit values, to use the minimum spec maximum available as
>> max_push_constants_size that is 128 bytes. So at the end the generated
>> intrinsic was:
>>
>> vec4 16 ssa_4 = intrinsic load_uniform (ssa_3) () (0, 128) /* base=0 */
>> /* range=128 */
>>
>> As the calculus here is to calculate the number of location used, and
>> taking into account that the Vulkan API restrictions for push constants
>> that says that push constant ranges that say that offset must be
>> multiple of 4 and size must be multiple of 4, maintain the use of
>> 4-bytes slots was ok for supporting the feature. Our code changes just
>> take the accountability in the number of 32-bits location needed, mainly
>> changing the divisions by 4 using DIV_ROUND_UP( , 4) to calculate sizes.
> I'm probably misunderstanding something. Let me ask a few clarifying 
> questions.
>
> I'm reading that the incoming 16-bit values are given in 32-bit slots, and for
> the same reason we place them in the push/pull buffers in 32-bits slots. In
> other words a vec4 would take 16-bytes and each component would 32-bits apart?

Probably I explained quite bad. A f16vec4 would use 8-bytes, and each
component
is going to be 16-bits apart. The 32-bit multiple offset only applies to
the first
element.

> If that is the case, then don't we need to adjust the register offsets
> somewhere the way I did in the fragment below? Otherwise the offsets will
> point to locations in the register that are simply 16-bits apart?

Yes compone

Re: [Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

2017-10-29 Thread Chema Casanova
On 29/10/17 19:55, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
>> We enable the use of 16-bit values in push constants
>> modifying the assign_constant_locations function to work
>> with 16-bit types.
>>
>> The API to access buffers in Vulkan use multiples of 4-byte for
>> offsets and sizes. Current accountability of uniforms based on 4-byte
>> slots will work for 16-bit values if they are allowed to use 32-bit
>> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
>> 2-byte elements will use 1 slot instead of 0.
>>
>> We aligns the 16-bit locations after assigning the 32-bit
>> ones.
>> ---
>>  src/intel/compiler/brw_fs.cpp | 30 +++---
>>  1 file changed, 23 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>> index a1d49a63be..8da16145dc 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -1909,8 +1909,9 @@ set_push_pull_constant_loc(unsigned uniform, int 
>> *chunk_start,
>> if (!contiguous) {
>>/* If bitsize doesn't match the target one, skip it */
>>if (*max_chunk_bitsize != target_bitsize) {
>> - /* FIXME: right now we only support 32 and 64-bit accesses */
>> - assert(*max_chunk_bitsize == 4 || *max_chunk_bitsize == 8);
>> + assert(*max_chunk_bitsize == 4 ||
>> +*max_chunk_bitsize == 8 ||
>> +*max_chunk_bitsize == 2);
>>   *max_chunk_bitsize = 0;
>>   *chunk_start = -1;
>>   return;
>> @@ -1987,8 +1988,9 @@ fs_visitor::assign_constant_locations()
>>   int constant_nr = inst->src[i].nr + inst->src[i].offset / 4;
> 
> Did you test this with, for example, vec4?

CTS has 16bit scalar, vec2 (uint,sint), vec4 (float) and matrix tests
for push constants for compute and graphics pipelines. For vec4 you can try:

dEQP-VK.spirv_assembly.instruction.compute.16bit_storage.push_constant_16_to_32.vector_float

For push constant tests in general there are 42 tests, but vec3 aren't
tested:

dEQP-VK.*16bit_storage.*push_constant.


> I've been toying with a glsl
> lowering pass changing mediump floats into float16. I was curious to know how
> much is needed as you have addressed most of the things from NIR onwards.
> Here I'm seeing offsets 0,2,4,6 which result into 0,0,1,1 when divided by
> four. Don't we need something of this sort in addition?

If i remember correctly, tests were testing to use push constants with
64 16bit values, to use the minimum spec maximum available as
max_push_constants_size that is 128 bytes. So at the end the generated
intrinsic was:

vec4 16 ssa_4 = intrinsic load_uniform (ssa_3) () (0, 128) /* base=0 */
/* range=128 */

As the calculus here is to calculate the number of location used, and
taking into account that the Vulkan API restrictions for push constants
that says that push constant ranges that say that offset must be
multiple of 4 and size must be multiple of 4, maintain the use of
4-bytes slots was ok for supporting the feature. Our code changes just
take the accountability in the number of 32-bits location needed, mainly
changing the divisions by 4 using DIV_ROUND_UP( , 4) to calculate sizes.

> commit 1a6d2bf3302f6e4305e383da0f27712dc5c20a67
> Author: Topi Pohjolainen 
> Date:   Sun Oct 29 20:28:03 2017 +0200
> 
> fix alignment of 16-bit uniforms on 32-bit slots
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
> b/src/intel/compiler/brw_fs_nir.cpp
> index 2f5443958a..586eb9d9ff 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4007,7 +4007,10 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
> nir_intrinsic_instr *instr
>   src.offset = const_offset->u32[0];
>  
>   for (unsigned j = 0; j < instr->num_components; j++) {
> -bld.MOV(offset(dest, bld, j), offset(src, bld, j));
> +const unsigned src_offset =
> +  src.type == BRW_REGISTER_TYPE_HF ? 2 * j : j;
> +
> +bld.MOV(offset(dest, bld, j), offset(src, bld, src_offset));
> 
> 
> 
> Then about the change of using 32-bit slots. This is now unconditional and
> would require revisiting if we wanted to pack 16-bits tighter and possibly
> increase the amount of uniforms that can be pushed. Similarly to Vulkan, in
> GL the core stores uniforms as floats and I think we should keep it that way.

> I added support in the i965 backend to keep track of the types of the
> uniforms and to convert 32-bit presentation to 16-bits on the fly in
> gen6_constant_state.c::brw_param_value(). I don't like it that much but I had
> to start from somewhere.

> My thinking is that we'd want to decouple the storage of the values and the
> packing used in the compiler backend. Ideally keeping the mesa gl core and the
> api working with full 32-bit floats but using tight 16-bit slots in the
> push/pull 

Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg

2017-10-24 Thread Chema Casanova
El 22/10/17 a las 12:31, Eduardo Lima Mitev escribió:
> On 10/12/2017 08:38 PM, Jose Maria Casanova Crespo wrote:
>> From: Eduardo Lima Mitev <el...@igalia.com>
>>
>> Currently, we use byte-scattered write messages for storing 16-bit
>> into an SSBO. This is because untyped surface messages have a fixed
>> 32-bit size.
>>
>> This patch optimizes these 16-bit writes by combining 2 values (e.g,
>> two consecutive components) into a 32-bit register, packing the two
>> 16-bit words.
>>
>> 16-bit single component values will continue to use byte-scattered
>> write messages.
>>
>> This optimization reduces the number of SEND messages used for storing
>> 16-bit values potentially by 2 or 4, which cuts down execution time
>> significantly because byte-scattered writes are an expensive
>> operation.
>>
>> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
>> Rework optimization using shuffle 16 write and enable writes
>> of 16bit vec4 with only one message of 32-bits. (Chema Casanova)
>>
>> Signed-off-by: Jose Maria Casanova Crespo <jmcasan...@igalia.com>
>> Signed-off-by: Eduardo Lima <el...@igalia.com>
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 64 
>> +++
>>  1 file changed, 52 insertions(+), 12 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index 2d0b3e139e..c07b3e4d8d 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>  instr->num_components);
>>   val_reg = tmp;
>>}
>> +  if (bit_size == 16) {
>> + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF);
> Spaces around the '='. Probably also remove the block since there  is
> only one instruction in it.

Fixed locally.

>
>> +  }
>>  
>>/* 16-bit types would use a minimum of 1 slot */
>>unsigned type_slots = MAX2(type_size / 4, 1);
>> @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>   unsigned first_component = ffs(writemask) - 1;
>>   unsigned length = ffs(~(writemask >> first_component)) - 1;
>>  
>> + fs_reg current_val_reg =
>> +offset(val_reg, bld, first_component * type_slots);
>> +
>>   /* We can't write more than 2 64-bit components at once. Limit the
>>* length of the write to what we can do and let the next iteration
>>* handle the rest
>> @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder 
>> , nir_intrinsic_instr *instr
>>   if (type_size > 4) {
>>  length = MIN2(2, length);
>>   } else if (type_size == 2) {
>> -/* For 16-bit types we are using byte scattered writes, that can
>> - * only write one component per call. So we limit the length, 
>> and
>> - * let the write happening in several iterations.
>> +/* For 16-bit types we pack two consecutive values into a
>> + * 32-bit word and use an untyped write message. For single 
>> values
>> + * we need to use byte-scattered writes because untyped writes 
>> work
>> + * on multiples of 32 bits.
>> + *
>> + * For example, if there is a 3-component vector we submit one
>> + * untyped-write message of 32-bit (first two components), and 
>> one
>> + * byte-scattered write message (the last component).
>>   */
>> -length = 1;
>> +if (length >= 2) {
>> +   /* pack two consecutive 16-bit words into a 32-bit register,
>> +* using the same original source register.
>> +*/
>> +   length -= length % 2;
>> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2);
>> +   shuffle_16bit_data_for_32bit_write(bld,
>> +  tmp,
>> +  current_val_reg,
>> +  length);
>> +   current_val_reg = tmp;
>> +
>> +} else {
>> +   /* For single 16-bit values, we just limit the length to 1 
>> and
>> +   

Re: [Mesa-dev] [PATCH v3 27/43] anv/pipeline: Use 32-bit surface formats for 16-bit formats

2017-10-24 Thread Chema Casanova
El 16/10/17 a las 08:57, Alejandro Piñeiro escribió:
> On 15/10/17 12:14, Pohjolainen, Topi wrote:
>> On Thu, Oct 12, 2017 at 08:38:16PM +0200, Jose Maria Casanova Crespo wrote:
>>> From: Alejandro Piñeiro 
>>>
>>> From Vulkan 1.0.50 spec, Section 3.30.1. Format Definition:
>>> VK_FORMAT_R16G16_SFLOAT
>>>
>>> A two-component, 32-bit signed floating-point format that has a
>>> 16-bit R component in bytes 0..1, and a 16-bit G component in
>>> bytes 2..3.
>>>
>>> So this format expects those 16-bit floats to be passed without any
>>> conversion (applies too using 2/3/4 components, and with int formats)
>>>
>>> But from skl PRM, vol 07, section FormatConversion, page 445 there is
>>> a table that points that *16*FLOAT formats are converted to FLOAT,
>>> that in that context, is a 32-bit float. This is similar to the
>>> *64*FLOAT formats, that converts 64-bit floats to 32-bit floats.
>>>
>>> Unfortunately, while with 64-bit floats we have the alternative to use
>>> *64*PASSTHRU formats, it is not the case with 16-bits.
>>>
>>> This issue happens too with 16-bit int surface formats.
>>>
>>> As a workaround, if we are using a 16-bit location at the shader, we
>>> use 32-bit formats to avoid the conversion, and will fix getting the
>>> proper content later. Note that as we are using 32-bit formats, we can
>>> use formats with less components (example: use *R32* for *R16G16*).
>>>
>>> Signed-off-by: Jose Maria Casanova Crespo 
>>> Signed-off-by: Alejandro Piñeiro 
>>> ---
>>>  src/intel/vulkan/genX_pipeline.c | 47 
>>> 
>>>  1 file changed, 47 insertions(+)
>>>
>>> diff --git a/src/intel/vulkan/genX_pipeline.c 
>>> b/src/intel/vulkan/genX_pipeline.c
>>> index c2fa9c0ff7..8b2d472787 100644
>>> --- a/src/intel/vulkan/genX_pipeline.c
>>> +++ b/src/intel/vulkan/genX_pipeline.c
>>> @@ -83,6 +83,44 @@ vertex_element_comp_control(enum isl_format format, 
>>> unsigned comp)
>>> }
>>>  }
>>>  
>>> +#if GEN_GEN >= 8
>>> +static enum isl_format
>>> +adjust_16bit_format(enum isl_format format)
>>> +{
>>> +   switch(format) {
>>> +   case ISL_FORMAT_R16_FLOAT:
>>> +  return ISL_FORMAT_R32_FLOAT;
>>> +   case ISL_FORMAT_R16G16_FLOAT:
>>> +  return ISL_FORMAT_R32_FLOAT;
>>> +   case ISL_FORMAT_R16G16B16_FLOAT:
>>> +  return ISL_FORMAT_R32G32_FLOAT;
>>> +   case ISL_FORMAT_R16G16B16A16_FLOAT:
>>> +  return ISL_FORMAT_R32G32_FLOAT;
>>> +
>>> +   case ISL_FORMAT_R16_SINT:
>>> +  return ISL_FORMAT_R32_SINT;
>>> +   case ISL_FORMAT_R16G16_SINT:
>>> +  return ISL_FORMAT_R32_SINT;
>>> +   case ISL_FORMAT_R16G16B16_SINT:
>>> +  return ISL_FORMAT_R32G32_SINT;
>>> +   case ISL_FORMAT_R16G16B16A16_SINT:
>>> +  return ISL_FORMAT_R32G32_SINT;
>>> +
>>> +   case ISL_FORMAT_R16_UINT:
>>> +  return ISL_FORMAT_R32_UINT;
>>> +   case ISL_FORMAT_R16G16_UINT:
>>> +  return ISL_FORMAT_R32_UINT;
>>> +   case ISL_FORMAT_R16G16B16_UINT:
>>> +  return ISL_FORMAT_R32G32_UINT;
>>> +   case ISL_FORMAT_R16G16B16A16_UINT:
>>> +  return ISL_FORMAT_R32G32_UINT;
>>> +
>>> +   default:
>>> +  return format;
>>> +   }
>>> +}
>> Just wondering aloud. As we are going to reinterpret the data in any case 
>> we could simply use _UINT variants even for FLOAT and SINT. It doesn't really
>> make any difference only that here someone might think it is somehow relevant
>> to keep the base type. 
> FWIW, I also don't mind too much. So ok, we could try that change too.
I've just sent a v2 of this patch. The code is simplified and doesn't
introduce regressions on 16-bit tests.

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 29/43] i965/fs: Unpack 16-bit from 32-bit components in VS load_input

2017-10-24 Thread Chema Casanova
El 15/10/17 a las 12:59, Pohjolainen, Topi escribió:
> On Thu, Oct 12, 2017 at 08:38:18PM +0200, Jose Maria Casanova Crespo wrote:
>> The VS load input for 16-bit values receives pairs of 16-bit values
>> packed in 32-bit values. Because of the adjusted format used at:
>>
>>  anv/pipeline: Use 32-bit surface formats for 16-bit formats
>>
>> v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand)
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 21 +++--
>>  1 file changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index d2f2e17b70..83ff0607a7 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -2322,8 +2322,25 @@ fs_visitor::nir_emit_vs_intrinsic(const fs_builder 
>> ,
>>assert(const_offset && "Indirect input loads not allowed");
>>src = offset(src, bld, const_offset->u32[0]);
>>  
>> -  for (unsigned j = 0; j < num_components; j++) {
>> - bld.MOV(offset(dest, bld, j), offset(src, bld, j + 
>> first_component));
>> +  /* The VS load input for 16-bit values receives pairs of 16-bit values
>> +   * packed in 32-bit values. This is an example on SIMD8:
>> +   *
>> +   * xy xy xy xy xy xy xy xy
>> +   * zw zw zw zw yw zw zw xw
> zw
Fixed locally.
>
>> +   *
>> +   * We need to format it to something like:
>> +   *
>> +   * xx xx xx xx yy yy yy yy
>> +   * zz zz zz zz ww ww ww ww
>> +   */
>> +  if (type_sz(type) == 2) {
>> + for (unsigned j = 0; j < num_components; j++)
>> +bld.MOV(offset(dest, bld, j),
>> +subscript(retype(offset(src,bld, (j / 2) * 2 + 
>> first_component),
> Space missing before 'bld'. I went thru the math and it looks correct to me.
Fixed locally. About the math, it is based in the same principles that
the one at shuffle_32bit_load_result_to_16bit_data function available at
the following patch of the series, but it is simpler as in this case
source and destination are different that we can not assume for the
general case.
>
>> + BRW_REGISTER_TYPE_F), type, j % 2));
>> +  } else {
>> + for (unsigned j = 0; j < num_components; j++)
>> +bld.MOV(offset(dest, bld, j), offset(src, bld, j + 
>> first_component));
>>}
>>  
>>if (type == BRW_REGISTER_TYPE_DF) {
>> -- 
>> 2.13.6
>>
>> ___
>

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 05/43] nir: Populate conversion opcodes to/from 16-bit types

2017-10-24 Thread Chema Casanova
El 21/10/17 a las 11:44, Pohjolainen, Topi escribió:
> On Sat, Oct 21, 2017 at 11:22:45AM +0300, Pohjolainen, Topi wrote:
>> On Thu, Oct 12, 2017 at 08:37:54PM +0200, Jose Maria Casanova Crespo wrote:
>>> From: Eduardo Lima Mitev 
>>>
>>> This will include the following NIR ALU opcodes:
>>>  * nir_op_i2i16
>>>  * nir_op_i2f16
>>>  * nir_op_u2u16
>>>  * nir_op_u2f16
>>>  * nir_op_f2i16
>>>  * nir_op_f2u16
>>>  * nir_op_f2f16
>> Subject says "...to/from 16-bit types", it should only say "to", right?
>>
>> I thought conversion from 16-bits to 32-bits was also needed but apparently
>> not (all the promotions seem to happen case by case in the backend. For 
>> example,
>> the move from 16-bits to 32-bits when 16-bit RT isn't supported). A few
>> words here in the commit would be nice explaining why only one direction is
>> needed.
> Right, I forgot, conversions in the other direction use simply f2* with source
> bit-size set accordingly.
>
> So, just drop "from" in the subject.
Just fixed the subject locally. At the early stages of the
implementation we started dealing with the 16->32 bit conversions but as
there were not alignment restrictions in this direction they are managed
just having the suitable bit-size in the source of the MOV. In this v3
series we removed some unnecessary alignments.

Thanks for devoting time to this review.

>
>>> Reviewed-by: Jason Ekstrand 
>>> ---
>>>  src/compiler/nir/nir_opcodes_c.py | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/src/compiler/nir/nir_opcodes_c.py 
>>> b/src/compiler/nir/nir_opcodes_c.py
>>> index a1db54f05a..02bb4738ed 100644
>>> --- a/src/compiler/nir/nir_opcodes_c.py
>>> +++ b/src/compiler/nir/nir_opcodes_c.py
>>> @@ -62,7 +62,7 @@ nir_type_conversion_op(nir_alu_type src, nir_alu_type dst)
>>>  % endif
>>>  %  endif
>>> switch (dst_bit_size) {
>>> -% for dst_bits in [32, 64]:
>>> +% for dst_bits in [16, 32, 64]:
>>>case ${dst_bits}:
>>>   return ${'nir_op_{0}2{1}{2}'.format(src_t[0], 
>>> dst_t[0], dst_bits)};
>>>  % endfor
>>> -- 
>>> 2.13.6
>>>
>>> ___
>>> mesa-dev mailing list
>>> mesa-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 25/43] compiler: Mark when input/ouput attribute at VS uses 16-bit

2017-10-17 Thread Chema Casanova
On 15/10/17 12:00, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:14PM +0200, Jose Maria Casanova Crespo wrote:
>> New shader attribute to mark when a location has 16-bit
>> value. This patch includes support on mesa glsl and nir.
>> ---
>>  src/compiler/glsl_types.h  | 24 
>>  src/compiler/nir/nir_gather_info.c | 23 ---
>>  src/compiler/nir_types.cpp |  6 ++
>>  src/compiler/nir_types.h   |  1 +
>>  src/compiler/shader_info.h |  2 ++
>>  5 files changed, 49 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/compiler/glsl_types.h b/src/compiler/glsl_types.h
>> index 32399df351..d05e612e66 100644
>> --- a/src/compiler/glsl_types.h
>> +++ b/src/compiler/glsl_types.h
>> @@ -93,6 +93,13 @@ static inline bool glsl_base_type_is_integer(enum 
>> glsl_base_type type)
>>type == GLSL_TYPE_IMAGE;
>>  }
>>  
>> +static inline bool glsl_base_type_is_16bit(enum glsl_base_type type)
>> +{
>> +   return type == GLSL_TYPE_FLOAT16 ||
>> +  type == GLSL_TYPE_UINT16 ||
>> +  type == GLSL_TYPE_INT16;
>> +}
>> +
>>  enum glsl_sampler_dim {
>> GLSL_SAMPLER_DIM_1D = 0,
>> GLSL_SAMPLER_DIM_2D,
>> @@ -546,6 +553,15 @@ struct glsl_type {
>>return is_64bit() && vector_elements > 2;
>> }
>>  
>> +
>> +   /**
>> +* Query whether a 16-bit type takes half slots.
>> +*/
>> +   bool is_half_slot() const
> 
> I haven't checked later patches but here at least I'm wondering why we need
> two functionally identical helpers with different names, i.e., is_half_slot()
> and is_16bit().

It is true that at this moment, any use of is_half_slot could be
directly changed for is_16bit.

So removing is_half_slot could simplify the understanding of the code.
Because at the end the idea behind having two names was simply to use
the concept of half_slots when tracking the location input attributes at
the VS with 16-bit in a similar way that it was done for 64-bits for
dual slots (64bits & (vec3 || vec4)) .

After thinking about it it would also clearer maintain the is_16bit as
helper for future uses. But in the particular case of checking half
slots we could just use:

(glsl_get_bit_size(glsl_without_array(var->type)) == 16)


In this case what we really matters is that we have 16-bit values so we
need to unshuffle them, independently that they use half of an slot that
is the case of 16-bits values.

>> +   {
>> +  return is_16bit();
>> +   }
>> +
>> /**
>>  * Query whether or not a type is 64-bit
>>  */
>> @@ -555,6 +571,14 @@ struct glsl_type {
>> }
>>  
>> /**
>> +* Query whether or not a type is 16-bit
>> +*/
>> +   bool is_16bit() const
>> +   {
>> +  return glsl_base_type_is_16bit(base_type);
>> +   }
>> +
>> +   /**
>>  * Query whether or not a type is a non-array boolean type
>>  */
>> bool is_boolean() const
>> diff --git a/src/compiler/nir/nir_gather_info.c 
>> b/src/compiler/nir/nir_gather_info.c
>> index ac87bec46c..c7f8ff29cb 100644
>> --- a/src/compiler/nir/nir_gather_info.c
>> +++ b/src/compiler/nir/nir_gather_info.c
>> @@ -212,14 +212,22 @@ gather_intrinsic_info(nir_intrinsic_instr *instr, 
>> nir_shader *shader)
>>   if (!try_mask_partial_io(shader, instr->variables[0]))
>>  mark_whole_variable(shader, var);
>>  
>> - /* We need to track which input_reads bits correspond to a
>> -  * dvec3/dvec4 input attribute */
>> + /* We need to track which input_reads bits correspond to
>> +  * dvec3/dvec4 or 16-bit  input attributes */
>>   if (shader->stage == MESA_SHADER_VERTEX &&
>> - var->data.mode == nir_var_shader_in &&
>> - glsl_type_is_dual_slot(glsl_without_array(var->type))) {
>> -for (uint i = 0; i < glsl_count_attribute_slots(var->type, 
>> false); i++) {
>> -   int idx = var->data.location + i;
>> -   shader->info.double_inputs_read |= BITFIELD64_BIT(idx);
>> + var->data.mode == nir_var_shader_in) {
>> +if (glsl_type_is_dual_slot(glsl_without_array(var->type))) {
>> +   for (uint i = 0; i < glsl_count_attribute_slots(var->type, 
>> false); i++) {
>> +  int idx = var->data.location + i;
>> +  shader->info.double_inputs_read |= BITFIELD64_BIT(idx);
>> +   }
>> +} else {
>> +   if (glsl_type_is_half_slot(glsl_without_array(var->type))) {
> 
> This could be:
> 
>} else if 
> (glsl_type_is_half_slot(glsl_without_array(var->type))) {
> 
> allowing us to reduce indentation in the block.

Also changing this with the change proposed before,

} else if (glsl_get_bit_size(glsl_without_array(var->type)) == 16) {

I'm sending an v2 of this patch with these changes.

>> +  for (uint i = 0; i < 
>> glsl_count_attribute_slots(var->type, false); i++) {
>> + int idx = var->data.location + 

Re: [Mesa-dev] [PATCH v3 23/43] i965/fs: Add byte scattered read message and fs support

2017-10-17 Thread Chema Casanova


On 15/10/17 11:47, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:12PM +0200, Jose Maria Casanova Crespo wrote:
>> ---
>>  src/intel/compiler/brw_eu.h|  7 +
>>  src/intel/compiler/brw_eu_defines.h|  2 ++
>>  src/intel/compiler/brw_eu_emit.c   | 41 
>> ++
>>  src/intel/compiler/brw_fs.cpp  | 10 +++
>>  src/intel/compiler/brw_fs_copy_propagation.cpp |  2 ++
>>  src/intel/compiler/brw_fs_generator.cpp|  5 
>>  src/intel/compiler/brw_fs_surface_builder.cpp  | 12 
>>  src/intel/compiler/brw_fs_surface_builder.h|  5 
>>  src/intel/compiler/brw_shader.cpp  |  6 
>>  9 files changed, 90 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>> index b44ca0f518..ca1ff21a83 100644
>> --- a/src/intel/compiler/brw_eu.h
>> +++ b/src/intel/compiler/brw_eu.h
>> @@ -476,6 +476,13 @@ brw_typed_surface_write(struct brw_codegen *p,
>>  unsigned num_channels);
>>  
>>  void
>> +brw_byte_scattered_read(struct brw_codegen *p,
>> +struct brw_reg dst,
>> +struct brw_reg payload,
>> +struct brw_reg surface,
>> +unsigned msg_length);
>> +
>> +void
>>  brw_byte_scattered_write(struct brw_codegen *p,
>>   struct brw_reg payload,
>>   struct brw_reg surface,
>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>> b/src/intel/compiler/brw_eu_defines.h
>> index 9aac385ba7..c5dc5fd5fb 100644
>> --- a/src/intel/compiler/brw_eu_defines.h
>> +++ b/src/intel/compiler/brw_eu_defines.h
>> @@ -397,6 +397,8 @@ enum opcode {
>>  * opcode, but instead of taking a single payload blog they expect their
>>  * arguments separately as individual sources, like untyped write/read.
>>  */
>> +   SHADER_OPCODE_BYTE_SCATTERED_READ,
>> +   SHADER_OPCODE_BYTE_SCATTERED_READ_LOGICAL,
>> SHADER_OPCODE_BYTE_SCATTERED_WRITE,
>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL,
>>  
>> diff --git a/src/intel/compiler/brw_eu_emit.c 
>> b/src/intel/compiler/brw_eu_emit.c
>> index 84d85be653..8c83d8b500 100644
>> --- a/src/intel/compiler/brw_eu_emit.c
>> +++ b/src/intel/compiler/brw_eu_emit.c
>> @@ -2929,6 +2929,47 @@ brw_untyped_surface_write(struct brw_codegen *p,
>>p, insn, num_channels);
>>  }
>>  
>> +
>> +
>> +static void
>> +brw_set_dp_byte_scattered_read_message(struct brw_codegen *p,
>> +   struct brw_inst *insn)
>> +{
>> +
>> +   const struct gen_device_info *devinfo = p->devinfo;
>> +   /* Set mask of 32-bit channels to drop. */
>> +   unsigned msg_control = GEN7_BYTE_SCATTERED_DATA_SIZE_WORD << 2;
>> +
>> +   if (brw_inst_access_mode(devinfo, p->current) == BRW_ALIGN_1) {
>> +  if (brw_inst_exec_size(devinfo, p->current) == BRW_EXECUTE_16)
>> + msg_control |= 1; /* SIMD16 mode */
>> +  else
>> + msg_control |= 2; /* SIMD8 mode */
>> +   }
>> +
>> +   brw_inst_set_dp_msg_type(devinfo, insn,
>> +(devinfo->gen >= 8 || devinfo->is_haswell ?
>> + HSW_DATAPORT_DC_PORT0_BYTE_SCATTERED_READ :
>> + GEN7_DATAPORT_DC_BYTE_SCATTERED_READ));
>> +   brw_inst_set_dp_msg_control(devinfo, insn, msg_control);
>> +}
>> +
>> +void
>> +brw_byte_scattered_read(struct brw_codegen *p,
>> +struct brw_reg dst,
>> +struct brw_reg payload,
>> +struct brw_reg surface,
>> +unsigned msg_length)
>> +{
>> +   const unsigned sfid =  GEN7_SFID_DATAPORT_DATA_CACHE;
>> +   struct brw_inst *insn = brw_send_indirect_scattered_message(
>> +  p, sfid, dst, payload, surface, msg_length,
>> +  brw_surface_payload_size(p, 1, true, true),
>> +  false);
>> +
>> +   brw_set_dp_byte_scattered_read_message(p, insn);
>> +}
>> +
>>  static void
>>  brw_set_dp_byte_scattered_write(struct brw_codegen *p,
>>  struct brw_inst *insn)
>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>> index e4a94ff053..bd0d32b741 100644
>> --- a/src/intel/compiler/brw_fs.cpp
>> +++ b/src/intel/compiler/brw_fs.cpp
>> @@ -251,6 +251,7 @@ fs_inst::is_send_from_grf() const
>> case SHADER_OPCODE_UNTYPED_SURFACE_READ:
>> case SHADER_OPCODE_UNTYPED_SURFACE_WRITE:
>> case SHADER_OPCODE_BYTE_SCATTERED_WRITE:
>> +   case SHADER_OPCODE_BYTE_SCATTERED_READ:
>> case SHADER_OPCODE_TYPED_ATOMIC:
>> case SHADER_OPCODE_TYPED_SURFACE_READ:
>> case SHADER_OPCODE_TYPED_SURFACE_WRITE:
>> @@ -733,6 +734,7 @@ fs_inst::components_read(unsigned i) const
>>  
>> case SHADER_OPCODE_UNTYPED_SURFACE_READ_LOGICAL:
>> case SHADER_OPCODE_TYPED_SURFACE_READ_LOGICAL:
>> +   case SHADER_OPCODE_BYTE_SCATTERED_READ_LOGICAL:
>>assert(src[3].file == 

Re: [Mesa-dev] [PATCH v3 19/43] i965/fs: Support push constants of 16-bit types

2017-10-14 Thread Chema Casanova


On 14/10/17 10:02, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:08PM +0200, Jose Maria Casanova Crespo wrote:
>> We enable the use of 16-bit values in push constants
>> modifying the assign_constant_locations function to work
>> with 16-bit types.
>>
>> The API to access buffers in Vulkan use multiples of 4-byte for
>> offsets and sizes. Current accountability of uniforms based on 4-byte
>> slots will work for 16-bit values if they are allowed to use 32-bit
>> slots. For that, we replace the division by 4 by a DIV_ROUND_UP, so
>> 2-byte elements will use 1 slot instead of 0.
>>
>> We aligns the 16-bit locations after assigning the 32-bit
> 
> s/aligns/align/
> 
Also fixed.

Thanks.

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 14/43] i965/fs: Handle 32-bit to 16-bit conversions

2017-10-14 Thread Chema Casanova


On 14/10/17 09:55, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:03PM +0200, Jose Maria Casanova Crespo wrote:
>> From: Alejandro Piñeiro 
>>
>> Conversions to 16-bit need having aligment between the 16-bit
>> and 32-bit types. So the conversion operations unpack 16-bit types
>> to with an stride=2 and then applies a MOV with the conversion.
>>
>> v2 (Jason Ekstrand):
>>   - Avoid the general use of stride=2 for 16-bit register types.
>>
>> Signed-off-by: Eduardo Lima 
>> Signed-off-by: Alejandro Piñeiro 
>> Signed-off-by: Jose Maria Casanova Crespo 
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 25 +
>>  1 file changed, 25 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index affe65d5e9..6908c7ea02 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -693,6 +693,31 @@ fs_visitor::nir_emit_alu(const fs_builder , 
>> nir_alu_instr *instr)
>>inst->saturate = instr->dest.saturate;
>>break;
>>  
>> +  /* In theory, it would be better to use BRW_OPCODE_F32TO16. Depending
>> +   * on the HW gen, it is a special hw opcode or just a MOV, and
>> +   * brw_F32TO16 (at brw_eu_emit) would do the work to chose.
>> +   *
>> +   * But if we want to use that opcode, we need to provide support on
>> +   * different optimizations and lowerings. As right now HF support is
>> +   * only for gen8+, it will be better to use directly the MOV, and use
>> +   * BRW_OPCODE_F32TO16 when/if we work for HF support on gen7.
>> +   */
>> +
>> +   case nir_op_f2f16:
>> +   case nir_op_i2i16:
>> +   case nir_op_u2u16: {
>> +  /* TODO: Fixing aligment rules for conversions from 32-bits to
>> +   * 16-bit types should be moved to lower_conversions
>> +   */
>> +  fs_reg tmp = bld.vgrf(op[0].type, 1);
>> +  tmp = subscript(tmp, result.type, 0);
>> +  inst = bld.MOV(tmp, op[0]);
>> +  inst->saturate = instr->dest.saturate;
>> +  inst = bld.MOV(result ,tmp);
> 
> Move space after ','
> 

Fixed locally.

Thanks.

>> +  inst->saturate = instr->dest.saturate;
>> +  break;
>> +   }
>> +
>> case nir_op_f2f64:
>> case nir_op_i2f64:
>> case nir_op_u2f64:
>> -- 
>> 2.13.6
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH v3 16/43] i965/fs: Define new shader opcode to set rounding modes

2017-10-14 Thread Chema Casanova


On 14/10/17 09:49, Pohjolainen, Topi wrote:
> On Thu, Oct 12, 2017 at 08:38:05PM +0200, Jose Maria Casanova Crespo wrote:
>> From: Alejandro Piñeiro 
>>
>> Although it is possible to emit them directly as AND/OR on brw_fs_nir,
>> having a specific opcode makes it easier to remove duplicate settings
>> later.
>>
>> v2: (Curro)
>>   - Set thread control to 'switch' when using the control register
>>   - Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
>> with the rounding mode.
>>   - Avoid magic numbers setting rounding mode field at control register.
>> v3: (Curro)
>>   - Remove redundant and add missing whitespace lines.
>>   - Match printing instruction to IR opcode "rnd_mode"
>>
>> Signed-off-by:  Alejandro Piñeiro 
>> Signed-off-by:  Jose Maria Casanova Crespo 
>> Reviewed-by: Francisco Jerez 
>> ---
>>  src/intel/compiler/brw_eu.h |  4 
>>  src/intel/compiler/brw_eu_defines.h | 16 
>>  src/intel/compiler/brw_eu_emit.c| 33 
>> +
>>  src/intel/compiler/brw_fs_generator.cpp |  5 +
>>  src/intel/compiler/brw_shader.cpp   |  4 
>>  5 files changed, 62 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>> index 8e597b212a..145942a54f 100644
>> --- a/src/intel/compiler/brw_eu.h
>> +++ b/src/intel/compiler/brw_eu.h
>> @@ -500,6 +500,10 @@ brw_broadcast(struct brw_codegen *p,
>>struct brw_reg src,
>>struct brw_reg idx);
>>  
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode);
>> +
>>  /***
>>   * brw_eu_util.c:
>>   */
>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>> b/src/intel/compiler/brw_eu_defines.h
>> index da482b73c5..6687883bfb 100644
>> --- a/src/intel/compiler/brw_eu_defines.h
>> +++ b/src/intel/compiler/brw_eu_defines.h
>> @@ -388,6 +388,8 @@ enum opcode {
>> SHADER_OPCODE_TYPED_SURFACE_WRITE,
>> SHADER_OPCODE_TYPED_SURFACE_WRITE_LOGICAL,
>>  
>> +   SHADER_OPCODE_RND_MODE,
>> +
>> SHADER_OPCODE_MEMORY_FENCE,
>>  
>> SHADER_OPCODE_GEN4_SCRATCH_READ,
>> @@ -1214,4 +1216,18 @@ enum brw_message_target {
>>  /* R0 */
>>  # define GEN7_GS_PAYLOAD_INSTANCE_ID_SHIFT  27
>>  
>> +/* CR0.0[5:4] Floating-Point Rounding Modes
>> + *  Skylake PRM, Volume 7 Part 1, "Control Register", page 756
>> + */
>> +
>> +#define BRW_CR0_RND_MODE_MASK 0x30
>> +#define BRW_CR0_RND_MODE_SHIFT4
>> +
>> +enum PACKED brw_rnd_mode {
>> +   BRW_RND_MODE_RTNE = 0,  /* Round to Nearest or Even */
>> +   BRW_RND_MODE_RU = 1,/* Round Up, toward +inf */
>> +   BRW_RND_MODE_RD = 2,/* Round Down, toward -inf */
>> +   BRW_RND_MODE_RTZ = 3,   /* Round Toward Zero */
>> +};
>> +
>>  #endif /* BRW_EU_DEFINES_H */
>> diff --git a/src/intel/compiler/brw_eu_emit.c 
>> b/src/intel/compiler/brw_eu_emit.c
>> index 2b38d959d1..8c1e4c5eae 100644
>> --- a/src/intel/compiler/brw_eu_emit.c
>> +++ b/src/intel/compiler/brw_eu_emit.c
>> @@ -3450,3 +3450,36 @@ brw_WAIT(struct brw_codegen *p)
>> brw_inst_set_exec_size(devinfo, insn, BRW_EXECUTE_1);
>> brw_inst_set_mask_control(devinfo, insn, BRW_MASK_DISABLE);
>>  }
>> +
>> +/**
>> + * Changes the floating point rounding mode updating the control register
>> + * field defined at cr0.0[5-6] bits. This function supports the changes to
>> + * RTNE (00), RU (01), RD (10) and RTZ (11) rounding using bitwise 
>> operations.
>> + * Only RTNE and RTZ rounding are enabled at nir.
>> + */
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode)
>> +{
>> +   const unsigned bits  = mode << BRW_CR0_RND_MODE_SHIFT;
> 
> Extra space before '='.
> 

Fixed locally

Thanks.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965/fs: Extend the live ranges of VGRFs which leave loops

2017-10-10 Thread Chema Casanova
With this patch applied I can not reproduce anymore the regression
related to cross-channel variable interference in non-uniformly executed
loops exposed at
dEQP-VK.glsl.return.return_in_dynamic_loop_dynamic_vertex when applying
Curro's liveness patch

Tested-by: Jose Maria Casanova Crespo 

On 05/10/17 20:52, Jason Ekstrand wrote:
> No Shader-db changes.
> 
> Cc: mesa-sta...@lists.freedesktop.org
> ---
>  src/intel/compiler/brw_fs_live_variables.cpp | 55 
> 
>  1 file changed, 55 insertions(+)
> 
> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp 
> b/src/intel/compiler/brw_fs_live_variables.cpp
> index c449672..380060d 100644
> --- a/src/intel/compiler/brw_fs_live_variables.cpp
> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
> @@ -223,6 +223,61 @@ fs_live_variables::compute_start_end()
>   }
>}
> }
> +
> +   /* Due to the explicit way the SIMD data is handled on GEN, we need to be 
> a
> +* bit more careful with live ranges and loops.  Consider the following
> +* example:
> +*
> +*vec4 color2;
> +*while (1) {
> +*   vec4 color = texture();
> +*   if (...) {
> +*  color2 = color * 2;
> +*  break;
> +*   }
> +*}
> +*gl_FragColor = color2;
> +*
> +* In this case, the definition of color2 dominates the use because the
> +* loop only has the one exit.  This means that the live range interval 
> for
> +* color2 goes from the statement in the if to it's use below the loop.
> +* Now suppose that the texture operation has a header register that gets
> +* assigned one of the registers used for color2.  If the loop condition 
> is
> +* non-uniform and some of the threads will take the and others will
> +* continue.  In this case, the next pass through the loop, the WE_all
> +* setup of the header register will stomp the disabled channels of color2
> +* and corrupt the value.
> +*
> +* This same problem can occur if you have a mix of 64, 32, and 16-bit
> +* registers because the channels do not line up or if you have a SIMD16
> +* program and the first half of one value overlaps the second half of the
> +* other.
> +*
> +* To solve this problem, we take any VGRFs whose live ranges cross the
> +* while instruction of a loop and extend their live ranges to the top of
> +* the loop.  This more accurately models the hardware because the value 
> in
> +* the VGRF needs to be carried through subsequent loop iterations in 
> order
> +* to remain valid when we finally do break.
> +*/
> +   foreach_block (block, cfg) {
> +  if (block->end()->opcode != BRW_OPCODE_WHILE)
> + continue;
> +
> +  /* This is a WHILE instrution. Find the DO block. */
> +  bblock_t *do_block = NULL;
> +  foreach_list_typed(bblock_link, child_link, link, >children) {
> + if (child_link->block->start_ip < block->end_ip) {
> +assert(do_block == NULL);
> +do_block = child_link->block;
> + }
> +  }
> +  assert(do_block);
> +
> +  for (int i = 0; i < num_vars; i++) {
> + if (start[i] < block->end_ip && end[i] > block->end_ip)
> +start[i] = MIN2(start[i], do_block->start_ip);
> +  }
> +   }
>  }
>  
>  fs_live_variables::fs_live_variables(fs_visitor *v, const cfg_t *cfg)
> 
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/fs: Restrict live intervals to the subset possibly reachable from any definition.

2017-09-08 Thread Chema Casanova
El 08/09/17 a las 11:06, Alejandro Piñeiro escribió:
> On 08/09/17 02:50, Francisco Jerez wrote:
>> Currently the liveness analysis pass would extend a live interval up
>> to the top of the program when no unconditional and complete
>> definition of the variable is found that dominates all of its uses.
>>
>> This can lead to a serious performance problem in shaders containing
>> many partial writes, like scalar arithmetic, FP64 and soon FP16
>> operations.  
> Just tested with the VK_KHR_16bit_storage implementation. Works really
> fine with the most problematic tests, so we can drop the "i965/fs: Add
> reuse_16bit_conversions_register optimization" patch (that was already
> NAKed by both you and Connor).
>
> My test was limited to that extension CTS tests, but in case it helps:
> Tested-by: Alejandro Piñeiro 

I've seen one regression on a full VK-CTS run with this patch over the
VK_KHR_16bit_storage branch, but I can reproduce the same regression
applying it on master.

Failing test is:
dEQP-VK.glsl.return.return_in_dynamic_loop_dynamic_vertex

>> The number of oversize live intervals in such workloads
>> can cause the compilation time of the shader to explode because of the
>> worse than quadratic behavior of the register allocator and scheduler
>> when running out of registers, and it can also cause the running time
>> of the shader to explode due to the amount of spilling it leads to,
>> which is orders of magnitude slower than GRF memory.
>>
>> This patch fixes it by computing the intersection of our current live
>> intervals with the subset of the program that can possibly be reached
>> from any definition of the variable.  Extending the storage allocation
>> of the variable beyond that is pretty useless because its value is
>> guaranteed to be undefined at a point that cannot be reached from any
>> definition.
>>
>> No significant change in the running time of shader-db (with 5%
>> statistical significance).
>>
>> shader-db results on IVB:
>>
>>   total cycles in shared programs: 61108780 -> 60932856 (-0.29%)
>>   cycles in affected programs: 16335482 -> 16159558 (-1.08%)
>>   helped: 5121
>>   HURT: 4347
>>
>>   total spills in shared programs: 1309 -> 1288 (-1.60%)
>>   spills in affected programs: 249 -> 228 (-8.43%)
>>   helped: 3
>>   HURT: 0
>>
>>   total fills in shared programs: 1652 -> 1597 (-3.33%)
>>   fills in affected programs: 262 -> 207 (-20.99%)
>>   helped: 4
>>   HURT: 0
>>
>>   LOST:   2
>>   GAINED: 209
>>
>> shader-db results on BDW:
>>
>>   total cycles in shared programs: 67617262 -> 67361220 (-0.38%)
>>   cycles in affected programs: 23397142 -> 23141100 (-1.09%)
>>   helped: 8045
>>   HURT: 6488
>>
>>   total spills in shared programs: 1456 -> 1252 (-14.01%)
>>   spills in affected programs: 465 -> 261 (-43.87%)
>>   helped: 3
>>   HURT: 0
>>
>>   total fills in shared programs: 1720 -> 1465 (-14.83%)
>>   fills in affected programs: 471 -> 216 (-54.14%)
>>   helped: 4
>>   HURT: 0
>>
>>   LOST:   2
>>   GAINED: 162
>>
>> shader-db results on SKL:
>>
>>   total cycles in shared programs: 65436248 -> 65245186 (-0.29%)
>>   cycles in affected programs: 22560936 -> 22369874 (-0.85%)
>>   helped: 8457
>>   HURT: 6247
>>
>>   total spills in shared programs: 437 -> 437 (0.00%)
>>   spills in affected programs: 0 -> 0
>>   helped: 0
>>   HURT: 0
>>
>>   total fills in shared programs: 870 -> 854 (-1.84%)
>>   fills in affected programs: 16 -> 0
>>   helped: 1
>>   HURT: 0
>>
>>   LOST:   0
>>   GAINED: 107
>> ---
>>  src/intel/compiler/brw_fs_live_variables.cpp | 34 
>> 
>>  src/intel/compiler/brw_fs_live_variables.h   | 12 ++
>>  2 files changed, 42 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp 
>> b/src/intel/compiler/brw_fs_live_variables.cpp
>> index c449672a519..059f076fa51 100644
>> --- a/src/intel/compiler/brw_fs_live_variables.cpp
>> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
>> @@ -83,9 +83,11 @@ fs_live_variables::setup_one_write(struct block_data *bd, 
>> fs_inst *inst,
>> /* The def[] bitset marks when an initialization in a block completely
>>  * screens off previous updates of that variable (VGRF channel).
>>  */
>> -   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
>> -  if (!BITSET_TEST(bd->use, var))
>> +   if (inst->dst.file == VGRF) {
>> +  if (!inst->is_partial_write() && !BITSET_TEST(bd->use, var))
>>   BITSET_SET(bd->def, var);
>> +
>> +  BITSET_SET(bd->defout, var);
>> }
>>  }
>>  
>> @@ -199,6 +201,28 @@ fs_live_variables::compute_live_variables()
>>   }
>>}
>> }
>> +
>> +   /* Propagate defin and defout down the CFG to calculate the union of live
>> +* variables potentially defined along any possible control flow path.
>> +*/
>> +   do {
>> +  cont = false;
>> +
>> +  foreach_block (block, cfg) {
>> + const struct block_data *bd = _data[block->num];
>> 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-09-06 Thread Chema Casanova
Hi Connor and Curro,

On 28/08/17 12:24, Alejandro Piñeiro wrote:
> On 27/08/17 20:24, Connor Abbott wrote:
>> Hi,
>>
>> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" > > wrote:
>>
>> On 24/08/17 21:07, Connor Abbott wrote:
>> >
>> > Hi Alejandro,
>>
>> Hi Connor,
>>
>> >
>> > This seems really suspicious. If the live ranges are really
>> > independent, then the register allocator should be able to
>> assign the
>> > two virtual registers to the same physical register if it needs to.
>>
>> Yes, it is true, the register allocator should be able to assign two
>> virtual registers to the same physical register. But that is done
>> at the
>> end (or really near the end), so late for the problem this
>> optimization
>> is trying to fix.
>>
>>
>> Well, my understanding is that the problem is long compilation times
>> due to spilling and our not-so-great implementation of it. So no,
>> register allocation is not late for the problem. As both Curro and I
>> explained, the change by itself can only pessimise register
>> allocation, so if it helps then it must be due to a bug in the
>> register allocator or a problem in a subsequent pass that's getting
>> hidden by this one.
> 
> Ok.
> 
>>
>> We are also reducing the amount of instructions used.
>>
>>
>> The comments in the source code say otherwise. Any instructions
>> eliminated were from spilling, which this pass only accidentally reduces.
> 
> Yes, sorry, I explained myself poorly. The optimization itself doesn't
> remove any instructions. But using it reduces the final number of
> instructions, although as you say, they are likely due reducing the
> spilling.
> 
>>
>>
>>
>> Probably not really clear on the commit message. When I say
>> "reduce the
>> pressure of the register allocator" I mean having a code that the
>> register allocator would be able to handle without using too much
>> time.
>> The problem this optimization tries to solve is that for some 16
>> bit CTS
>> tests (some with matrices and geometry shaders), the amount of virtual
>> registers used and instructions was really big. For the record,
>> initially, some tests needed 24 min just to compile. Right now, thanks
>> to other optimizations, the slower test without this optimization
>> needs
>> 1min 30 seconds. Adding some hacky timestamps, the time used  at
>> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>>
>> * While trying to schedule using the three available pre mode
>> heuristics: 7 seconds
>> * Allocation with spilling: 63 seconds
>> * Final schedule using SCHEDULE_POST: 19 seconds
>>
>> With this optimization, the total time goes down to 14 seconds (10
>> + 0 +
>> 3 on the previous bullet point list).
>>
>> One could argue that 1min 30 seconds is okish. But taking into account
>> that it goes down to 14 seconds, even with some caveats (see below), I
>> still think that it is worth to use the optimization.
>>
>> And a final comment. For that same test, this is the final stats
>> (using
>> INTEL_DEBUG):
>>
>>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>> 130320 cycles. 15:9 spills:fills.
>>  * Without the optimization: SIMD8 shader: 12312 instructions. 0
>> loops.
>> 174816 cycles. 751:1851 spills:fills.
>>
>>
>> So, the fact that it helps at all with SIMD8 shows that my theory is
>> wrong, but since your pass reduces spilling, it clearly must be
>> avoiding a bug somewhere else. You need to compare the IR for a shader
>> with the problem with and without this pass right before register
>> allocation. Maybe the sources and destinations of the conversion
>> instructions interfere without the change due to some other pass
>> that's increasing register pressure, in which case that's the problem,
>> but I doubt it.
> 
> Ok, thanks for the hints.

After some research we found that we need to adapt the live_variables
algorithm to support 32 to 16-bit conversions. Because of the HW
alignment restrictions these conversions need that the result register
uses stride=2, so it is not continuous (stride!=1) so by definition
is_partial_write returns true. Any of the next last 3 conditions could
be true when we use 16-bit types.

bool
fs_inst::is_partial_write() const
{
   return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
   (this->exec_size * type_sz(this->dst.type)) < 32 ||
   !this->dst.is_contiguous() ||
   this->dst.offset % REG_SIZE != 0);
}

So at the check on the setup_one_write function at
brw_fs_live_variables.cpp the variable isn't marked as defined
completely in the block.

   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
  if (!BITSET_TEST(bd->use, var))
 BITSET_SET(bd->def, var);
   }

That makes that the live start of the variable is expected to defined

Re: [Mesa-dev] [PATCH v2] i965/fs: Define new shader opcode to set rounding modes

2017-09-06 Thread Chema Casanova

On 05/09/17 23:41, Francisco Jerez wrote:
> Alejandro Piñeiro  writes:
> 
>> Although it is possible to emit them directly as AND/OR on brw_fs_nir,
>> having a specific opcode makes it easier to remove duplicate settings
>> later.
>>
>> v2: (Curro)
>>   - Set thread control to 'switch' when using the control register
>>   - Use a single SHADER_OPCODE_RND_MODE opcode taking an immediate
>> with the rounding mode.
>>   - Avoid magic numbers setting rounding mode field at control register.
>>
>> Signed-off-by:  Alejandro Piñeiro 
>> Signed-off-by:  Jose Maria Casanova Crespo 
>> ---
>>  src/intel/compiler/brw_eu.h |  3 +++
>>  src/intel/compiler/brw_eu_defines.h | 17 +
>>  src/intel/compiler/brw_eu_emit.c| 34 
>> +
>>  src/intel/compiler/brw_fs_generator.cpp |  5 +
>>  src/intel/compiler/brw_shader.cpp   |  4 
>>  5 files changed, 63 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>> index 8e597b212a6..106bf03530d 100644
>> --- a/src/intel/compiler/brw_eu.h
>> +++ b/src/intel/compiler/brw_eu.h
>> @@ -500,6 +500,9 @@ brw_broadcast(struct brw_codegen *p,
>>struct brw_reg src,
>>struct brw_reg idx);
>>  
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode);
> 
> Missing whitespace line.

Ok

> 
>>  /***
>>   * brw_eu_util.c:
>>   */
>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>> b/src/intel/compiler/brw_eu_defines.h
>> index da482b73c58..91d88fe8952 100644
>> --- a/src/intel/compiler/brw_eu_defines.h
>> +++ b/src/intel/compiler/brw_eu_defines.h
>> @@ -388,6 +388,9 @@ enum opcode {
>> SHADER_OPCODE_TYPED_SURFACE_WRITE,
>> SHADER_OPCODE_TYPED_SURFACE_WRITE_LOGICAL,
>>  
>> +
> 
> Redundant whitespace.

OK.

> 
>> +   SHADER_OPCODE_RND_MODE,
>> +
>> SHADER_OPCODE_MEMORY_FENCE,
>>  
>> SHADER_OPCODE_GEN4_SCRATCH_READ,
>> @@ -1214,4 +1217,18 @@ enum brw_message_target {
>>  /* R0 */
>>  # define GEN7_GS_PAYLOAD_INSTANCE_ID_SHIFT  27
>>  
>> +/* CR0.0[5:4] Floating-Point Rounding Modes
>> + *  Skylake PRM, Volume 7 Part 1, "Control Register", page 756
>> + */
>> +
>> +#define BRW_CR0_RND_MODE_MASK 0x30
>> +#define BRW_CR0_RND_MODE_SHIFT4
>> +
>> +enum PACKED brw_rnd_mode {
>> +   BRW_RND_MODE_RTNE = 0,  /* Round to Nearest or Even */
>> +   BRW_RND_MODE_RU = 1,/* Round Up, toward +inf */
>> +   BRW_RND_MODE_RD = 2,/* Round Down, toward -inf */
>> +   BRW_RND_MODE_RTZ = 3/* Round Toward Zero */
>> +};
>> +
>>  #endif /* BRW_EU_DEFINES_H */
>> diff --git a/src/intel/compiler/brw_eu_emit.c 
>> b/src/intel/compiler/brw_eu_emit.c
>> index 8c952e7da26..12164653e47 100644
>> --- a/src/intel/compiler/brw_eu_emit.c
>> +++ b/src/intel/compiler/brw_eu_emit.c
>> @@ -3530,3 +3530,37 @@ brw_WAIT(struct brw_codegen *p)
>> brw_inst_set_exec_size(devinfo, insn, BRW_EXECUTE_1);
>> brw_inst_set_mask_control(devinfo, insn, BRW_MASK_DISABLE);
>>  }
>> +
>> +/**
>> + * Changes the floating point rounding mode updating the control register
>> + * field defined at cr0.0[5-6] bits. This function supports the changes to
>> + * RTNE (00), RU (01), RD (10) and RTZ (11) rounding using bitwise 
>> operations.
>> + * Only RTNE and RTZ rounding are enabled at nir.
>> + */
>> +
> 
> Redundant whitespace.

OK.

> 
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode)
>> +{
>> +   const unsigned bits  = mode << BRW_CR0_RND_MODE_SHIFT;
>> +
>> +   if (bits != BRW_CR0_RND_MODE_MASK) {
>> +  brw_inst *inst = brw_AND(p, brw_cr0_reg(0), brw_cr0_reg(0),
>> +   brw_imm_ud(~BRW_CR0_RND_MODE_MASK));
>> +
>> +  /* From the Skylake PRM, Volume 7, page 760:
>> +   *  "Implementation Restriction on Register Access: When the control
>> +   *   register is used as an explicit source and/or destination, 
>> hardware
>> +   *   does not ensure execution pipeline coherency. Software must set 
>> the
>> +   *   thread control field to ‘switch’ for an instruction that uses
>> +   *   control register as an explicit operand."
>> +   */
>> +  brw_inst_set_thread_control(p->devinfo, inst, BRW_THREAD_SWITCH);
>> +}
>> +
>> +   if (bits) {
>> +  brw_inst *inst = brw_OR(p, brw_cr0_reg(0), brw_cr0_reg(0),
>> +  brw_imm_ud(bits));
>> +  brw_inst_set_thread_control(p->devinfo, inst, BRW_THREAD_SWITCH);
>> +   }
>> +}
>> diff --git a/src/intel/compiler/brw_fs_generator.cpp 
>> b/src/intel/compiler/brw_fs_generator.cpp
>> index afaec5c9497..ff9880ebfe8 100644
>> --- a/src/intel/compiler/brw_fs_generator.cpp
>> +++ b/src/intel/compiler/brw_fs_generator.cpp
>> @@ -2144,6 +2144,11 @@ fs_generator::generate_code(const cfg_t *cfg, int 
>> 

Re: [Mesa-dev] [PATCH 20/47] i965/fs: Define new shader opcodes to set rounding modes

2017-08-29 Thread Chema Casanova


On 29/08/17 21:18, Francisco Jerez wrote:
> Chema Casanova <jmcasan...@igalia.com> writes:
> 
>> El 25/08/17 a las 20:09, Francisco Jerez escribió:
>>> Alejandro Piñeiro <apinhe...@igalia.com> writes:
>>>
>>>> Although it is possible to emit them directly as AND/OR on brw_fs_nir,
>>>> having specific opcodes makes it easier to remove duplicate settings
>>>> later.
>>>>
>>>> Signed-off-by:  Alejandro Piñeiro <apinhe...@igalia.com>
>>>> Signed-off-by:  Jose Maria Casanova Crespo <jmcasan...@igalia.com>
>>>> ---
>>>>  src/intel/compiler/brw_eu.h |  3 +++
>>>>  src/intel/compiler/brw_eu_defines.h |  9 +
>>>>  src/intel/compiler/brw_eu_emit.c| 19 +++
>>>>  src/intel/compiler/brw_fs_generator.cpp |  8 
>>>>  src/intel/compiler/brw_shader.cpp   |  5 +
>>>>  5 files changed, 44 insertions(+)
>>>>
>>>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>>>> index a3a9c63239d..0a7f8020398 100644
>>>> --- a/src/intel/compiler/brw_eu.h
>>>> +++ b/src/intel/compiler/brw_eu.h
>>>> @@ -500,6 +500,9 @@ brw_broadcast(struct brw_codegen *p,
>>>>struct brw_reg src,
>>>>struct brw_reg idx);
>>>>  
>>>> +void
>>>> +brw_rounding_mode(struct brw_codegen *p,
>>>> +  enum brw_rnd_mode mode);
>>>>  /***
>>>>   * brw_eu_util.c:
>>>>   */
>>>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>>>> b/src/intel/compiler/brw_eu_defines.h
>>>> index 1af835d47ed..50435df2fcf 100644
>>>> --- a/src/intel/compiler/brw_eu_defines.h
>>>> +++ b/src/intel/compiler/brw_eu_defines.h
>>>> @@ -388,6 +388,9 @@ enum opcode {
>>>> SHADER_OPCODE_TYPED_SURFACE_WRITE,
>>>> SHADER_OPCODE_TYPED_SURFACE_WRITE_LOGICAL,
>>>>  
>>>> +   SHADER_OPCODE_RND_MODE_RTE,
>>>> +   SHADER_OPCODE_RND_MODE_RTZ,
>>>> +
>>> We don't need an opcode for each possible rounding mode (there's also RU
>>> and RD).  How about you add a single SHADER_OPCODE_RND_MODE opcode
>>> taking an immediate with the right rounding mode?
>> I like the proposal. It is better having a unique opcode for setting the
>> rounding mode. Changed for v2 of this patch.
>>> Also, you should be marking the rounding mode opcodes as
>>> has_side_effects(), because otherwise you're giving the scheduler the
>>> freedom of moving your rounding mode update instruction past the
>>> instruction you wanted it to have an effect on...
>> Well pointed, we already realized that it was missing while debugging
>> the latency problem of the control register. It was hiding the problem
>> re-scheduling the cr0 modification to the beginning of the shader.
>>>> SHADER_OPCODE_MEMORY_FENCE,
>>>>  
>>>> SHADER_OPCODE_GEN4_SCRATCH_READ,
>>>> @@ -1233,4 +1236,10 @@ enum brw_message_target {
>>>>  /* R0 */
>>>>  # define GEN7_GS_PAYLOAD_INSTANCE_ID_SHIFT27
>>>>  
>>>> +enum PACKED brw_rnd_mode {
>>>> +   BRW_RND_MODE_UNSPECIFIED,
>>>> +   BRW_RND_MODE_RTE,
>>>> +   BRW_RND_MODE_RTZ,
>>> Since you're introducing a back-end-specific rounding mode enum already,
>>> why not use the hardware values right away so you avoid hard-coding
>>> magic constants below.
>> At the end we removed  MODE_UNSPECIFIED as it isn't really needed and we
>> include Round Up and Down in the enum assignation based using the PRM
>> values. Also renamed RTE for RTNE to maintain coherence with nir
>> conversion modifiers using the same acronym as PRM.
>>
>> +enum PACKED brw_rnd_mode {
>> +   BRW_RND_MODE_RTNE = 0,  /* Round to Nearest or Even */
>> +   BRW_RND_MODE_RU = 1,/* Round Up, toward +inf */
>> +   BRW_RND_MODE_RD = 2,/* Round Down, toward -inf */
>> +   BRW_RND_MODE_RTZ = 3/* Round Toward Zero */
>> +};
>>
>> I have a doubt about how to avoid the magic constants to close the v2 of
>> this patch. One approach would be using the same code structure and
>> taking advantage using the codification of rounding field. This way we
>> just formula and expect that the C compiler optimizer to guess that the
>> immediate value is a constant.
>

Re: [Mesa-dev] [PATCH 20/47] i965/fs: Define new shader opcodes to set rounding modes

2017-08-29 Thread Chema Casanova
El 25/08/17 a las 20:09, Francisco Jerez escribió:
> Alejandro Piñeiro  writes:
>
>> Although it is possible to emit them directly as AND/OR on brw_fs_nir,
>> having specific opcodes makes it easier to remove duplicate settings
>> later.
>>
>> Signed-off-by:  Alejandro Piñeiro 
>> Signed-off-by:  Jose Maria Casanova Crespo 
>> ---
>>  src/intel/compiler/brw_eu.h |  3 +++
>>  src/intel/compiler/brw_eu_defines.h |  9 +
>>  src/intel/compiler/brw_eu_emit.c| 19 +++
>>  src/intel/compiler/brw_fs_generator.cpp |  8 
>>  src/intel/compiler/brw_shader.cpp   |  5 +
>>  5 files changed, 44 insertions(+)
>>
>> diff --git a/src/intel/compiler/brw_eu.h b/src/intel/compiler/brw_eu.h
>> index a3a9c63239d..0a7f8020398 100644
>> --- a/src/intel/compiler/brw_eu.h
>> +++ b/src/intel/compiler/brw_eu.h
>> @@ -500,6 +500,9 @@ brw_broadcast(struct brw_codegen *p,
>>struct brw_reg src,
>>struct brw_reg idx);
>>  
>> +void
>> +brw_rounding_mode(struct brw_codegen *p,
>> +  enum brw_rnd_mode mode);
>>  /***
>>   * brw_eu_util.c:
>>   */
>> diff --git a/src/intel/compiler/brw_eu_defines.h 
>> b/src/intel/compiler/brw_eu_defines.h
>> index 1af835d47ed..50435df2fcf 100644
>> --- a/src/intel/compiler/brw_eu_defines.h
>> +++ b/src/intel/compiler/brw_eu_defines.h
>> @@ -388,6 +388,9 @@ enum opcode {
>> SHADER_OPCODE_TYPED_SURFACE_WRITE,
>> SHADER_OPCODE_TYPED_SURFACE_WRITE_LOGICAL,
>>  
>> +   SHADER_OPCODE_RND_MODE_RTE,
>> +   SHADER_OPCODE_RND_MODE_RTZ,
>> +
> We don't need an opcode for each possible rounding mode (there's also RU
> and RD).  How about you add a single SHADER_OPCODE_RND_MODE opcode
> taking an immediate with the right rounding mode?
I like the proposal. It is better having a unique opcode for setting the
rounding mode. Changed for v2 of this patch.
> Also, you should be marking the rounding mode opcodes as
> has_side_effects(), because otherwise you're giving the scheduler the
> freedom of moving your rounding mode update instruction past the
> instruction you wanted it to have an effect on...
Well pointed, we already realized that it was missing while debugging
the latency problem of the control register. It was hiding the problem
re-scheduling the cr0 modification to the beginning of the shader.
>> SHADER_OPCODE_MEMORY_FENCE,
>>  
>> SHADER_OPCODE_GEN4_SCRATCH_READ,
>> @@ -1233,4 +1236,10 @@ enum brw_message_target {
>>  /* R0 */
>>  # define GEN7_GS_PAYLOAD_INSTANCE_ID_SHIFT  27
>>  
>> +enum PACKED brw_rnd_mode {
>> +   BRW_RND_MODE_UNSPECIFIED,
>> +   BRW_RND_MODE_RTE,
>> +   BRW_RND_MODE_RTZ,
> Since you're introducing a back-end-specific rounding mode enum already,
> why not use the hardware values right away so you avoid hard-coding
> magic constants below.
At the end we removed  MODE_UNSPECIFIED as it isn't really needed and we
include Round Up and Down in the enum assignation based using the PRM
values. Also renamed RTE for RTNE to maintain coherence with nir
conversion modifiers using the same acronym as PRM.

+enum PACKED brw_rnd_mode {
+   BRW_RND_MODE_RTNE = 0,  /* Round to Nearest or Even */
+   BRW_RND_MODE_RU = 1,    /* Round Up, toward +inf */
+   BRW_RND_MODE_RD = 2,    /* Round Down, toward -inf */
+   BRW_RND_MODE_RTZ = 3    /* Round Toward Zero */
+};

I have a doubt about how to avoid the magic constants to close the v2 of
this patch. One approach would be using the same code structure and
taking advantage using the codification of rounding field. This way we
just formula and expect that the C compiler optimizer to guess that the
immediate value is a constant.

switch (mode) {
    case BRW_RND_MODE_RTZ:
-  inst = brw_OR(p, brw_cr0_reg(0), brw_cr0_reg(0),
brw_imm_ud(0x0030u));
+  inst = brw_OR(p, brw_cr0_reg(0), brw_cr0_reg(0),
brw_imm_ud(((unsigned int) mode << 4)));
   break;
-   case BRW_RND_MODE_RTE:
-  inst = brw_AND(p, brw_cr0_reg(0), brw_cr0_reg(0),
brw_imm_ud(0xffcfu));
+   case BRW_RND_MODE_RTNE:
+  inst = brw_AND(p, brw_cr0_reg(0), brw_cr0_reg(0),
brw_imm_ud(((unsigned int) mode << 4) | ~0x0030u));
   break;
    default:

Another approach could be to implement a general solution for all
supported rounding modes by the hw including Round Up and Down using
bitwise operations.

/**
 * Changes the floating point rounding mode updating the control register
 * field defined at cr0.0[5-6] bits. This function supports the changes to
 * RTNE (00), RU (01), RD (10) and RTZ (11) rounding using bitwise
operations.
 * Only RTNE and RTZ rounding are enabled at nir.
 */

void
brw_rounding_mode(struct brw_codegen *p,
  enum brw_rnd_mode mode)
{

   const unsigned int mask = 0x30u;
   const unsigned int enable_bits =  ((unsigned int) mode) << 4;
   const unsigned int 

Re: [Mesa-dev] [PATCH 12/47] i965/fs: Add brw_reg_type_from_bit_size utility method

2017-08-28 Thread Chema Casanova
El 26/08/17 a las 19:19, Jason Ekstrand escribió:
> On Thu, Aug 24, 2017 at 6:54 AM, Alejandro Piñeiro
> > wrote:
>
> Returns the brw_type for a given ssa.bit_size, and a reference type.
> So if bit_size is 64, and the reference type is BRW_REGISTER_TYPE_F,
> it returns BRW_REGISTER_TYPE_DF. The same applies if bit_size is 32
> and reference type is BRW_REGISTER_TYPE_HF it returns
> BRW_REGISTER_TYPE_F
>
> Signed-off-by: Jose Maria Casanova Crespo  >
> Signed-off-by: Alejandro Piñeiro  
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 67
> ---
>  1 file changed, 62 insertions(+), 5 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index d760946e624..e4eba1401f8 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -214,6 +214,63 @@ fs_visitor::nir_emit_system_values()
>     }
>  }
>
> +/*
> + * Returns a type based on a reference_type (word, float,
> half-float) and a
> + * given bit_size.
> + *
> + * Reference BRW_REGISTER_TYPE are HF,F,DF,W,D,UW,UD.
> + *
> + * @FIXME: 64-bit return types are always DF on integer types to
> maintain
> + * compability with uses of DF previously to the introduction of
> int64
> + * support.
> + */
> +static brw_reg_type
> +brw_reg_type_from_bit_size(const unsigned bit_size,
> +                           const brw_reg_type reference_type)
> +{
> +   switch(reference_type) {
> +   case BRW_REGISTER_TYPE_HF:
> +   case BRW_REGISTER_TYPE_F:
> +   case BRW_REGISTER_TYPE_DF:
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_HF;
> +      case 32:
> +         return BRW_REGISTER_TYPE_F;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
> +      default:
> +         unreachable("Not reached");
>
>
> Please add something more descriptive here such as "Invalid bit size"
OK.
>  
>
> +      }
> +   case BRW_REGISTER_TYPE_W:
> +   case BRW_REGISTER_TYPE_D:
>
>
> Please add the Q type
OK.
>  
>
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_W;
> +      case 32:
> +         return BRW_REGISTER_TYPE_D;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
> +      default:
> +         unreachable("Not reached");
>
>
> Better message
OK.
>  
>
> +      }
> +   case BRW_REGISTER_TYPE_UW:
> +   case BRW_REGISTER_TYPE_UD:
>
>
> Please add the UQ type
OK.
>  
>
> +      switch(bit_size) {
> +      case 16:
> +         return BRW_REGISTER_TYPE_UW;
> +      case 32:
> +         return BRW_REGISTER_TYPE_UD;
> +      case 64:
> +         return BRW_REGISTER_TYPE_DF;
> +      default:
> +         unreachable("Not reached");
>
>
> better message
>  
>
> +      }
> +   default:
> +      unreachable("Not reached");
>
>
> better message
>
> I've got all those fixes in a version of this patch I pulled into my
> subgroups tree.
>  
So I finally picked locally your review of this patch from:
https://cgit.freedesktop.org/~jekstrand/mesa/commit/?id=c21aee439ffc15a7b8cec811727c0efb5e2bfa6c

I didn't find your subgroups tree.

Thanks for the review.

Chema
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


  1   2   >