Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Jason Ekstrand
On Thu, Jan 17, 2019 at 5:11 PM Francisco Jerez 
wrote:

> Subject is still inaccurate.  How about "intel/fs: Don't touch
> accumulator destination while applying regioning alignment rule."
>

Sounds good to me.  I'll fix it.


>
> Jason Ekstrand  writes:
>
> > In some shaders, you can end up with a stride in the source of a
> > SHADER_OPCODE_MULH.  One way this can happen is if the MULH is acting on
> > the top bits of a 64-bit value due to 64-bit integer lowering.  In this
> > case, the compiler will produce something like this:
> >
> > mul(8)acc0<1>UD   g5<8,4,2>UD   0x0004UW  { align1 1Q };
> > mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q
> AccWrEnable };
> >
> > The new region fixup pass looks at the MUL and sees a strided source and
> > unstrided destination and determines that the sequence is illegal.  It
> > then attempts to fix the illegal stride by replacing the destination of
> > the MUL with a temporary and emitting a MOV into the accumulator:
> >
> > mul(8)g9<2>UD g5<8,4,2>UD   0x0004UW  { align1 1Q };
> > mov(8)acc0<1>UD   g9<8,4,2>UD { align1 1Q };
> > mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q
> AccWrEnable };
> >
> > Unfortunately, this new sequence isn't correct because MOV accesses the
> > accumulator with a different precision to MUL and, instead of filling
> > the bottom 32 bits with the source and zeroing the top 32 bits, it
> > leaves the top 32 (or maybe 31) bits alone and full of garbage.  When
> > the MACH comes along and tries to complete the multiplication, the
> > result is correct in the bottom 32 bits (which we throw away) and
> > garbage in the top 32 bits which are actually returned by MACH.
> >
> > This commit does two things:  First, it adds an assert to ensure that we
> > don't try to rewrite accumulator destinations of MUL instructions so we
> > can avoid this precision issue.  Second, it modifies
> > required_dst_byte_stride to require a tightly packed stride so that we
> > fix up the sources instead and the actual code which gets emitted is
> > this:
> >
> > mov(8)g9<1>UD g5<8,4,2>UD { align1 1Q };
> > mul(8)acc0<1>UD   g9<8,8,1>UD   0x0004UW  { align1 1Q };
> > mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q
> AccWrEnable };
> >
> > Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> > Cc: Francisco Jerez 
> > ---
> >  src/intel/compiler/brw_fs_lower_regioning.cpp | 24 ++-
> >  1 file changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > index cc4163b4c2c..00cb5769ebe 100644
> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > @@ -53,7 +53,21 @@ namespace {
> > unsigned
> > required_dst_byte_stride(const fs_inst *inst)
> > {
> > -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> > +  if (inst->dst.is_accumulator()) {
> > + /* If the destination is an accumulator, insist that we leave
> the
> > +  * stride alone.  We cannot "fix" accumulator destinations by
> writing
> > +  * to a temporary and emitting a MOV into the original
> destination.
> > +  * For multiply instructions (our one use of the accumulator),
> the
> > +  * MUL writes the full 66 bits of the accumulator whereas the
> MOV we
> > +  * would emit only writes 33 bits ane leaves the top 33 bits
>
> and*
>
> > +  * undefined.
> > +  *
> > +  * It's safe to just require the original stride here because
> the
> > +  * lowering pass will detect the mismatch in
> required_src_byte_stride
> > +  * just fix up the sources of the multiply instead of the
> destination.
>
> There isn't such a thing as "required_src_byte_stride".  Conjunction
> missing between that sentence and the "just fix up..." one.
>
> Code is still:
>
> Reviewed-by: Francisco Jerez 
>

Thanks!  I'll do one more Jenkins run and push it.

--Jason


> > +  */
> > + return inst->dst.stride * type_sz(inst->dst.type);
> > +  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> >!is_byte_raw_mov(inst)) {
> >   return get_exec_type_size(inst);
> >} else {
> > @@ -316,6 +330,14 @@ namespace {
> > bool
> > lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
> > {
> > +  /* We cannot replace the result of an integer multiply which
> writes the
> > +   * accumulator because MUL+MACH pairs act on the accumulator as a
> 66-bit
> > +   * value whereas the MOV will act on only 32 or 33 bits of the
> > +   * accumulator.
> > +   */
> > +  assert(inst->opcode != BRW_OPCODE_MUL ||
> !inst->dst.is_accumulator() ||
> > + brw_reg_type_is_floating_point(inst->dst.type));
> > +
> >const fs_builder 

Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Francisco Jerez
Jason Ekstrand  writes:

> On Thu, Jan 17, 2019 at 3:34 PM Francisco Jerez 
> wrote:
>
>> Jason Ekstrand  writes:
>>
>> > Bah... previous e-mail unfinished.  Please ignore.
>> >
>> > On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez 
>> > wrote:
>> >
>> >> Jason Ekstrand  writes:
>> >>
>> >> > The pass was discovered to cause problems with the MUL+MACH
>> combination
>> >> > we emit for nir_[iu]mul_high.  In an experimental branch of mine, I
>> ran
>> >> > into issues where the MUL+MACH ended up using a strided source due to
>> >> > working on half of a uint64_t and the new lowering pass helpfully
>> tried
>> >> > to fix the multiply which wrote to an unstriated accumulator.
>> >>
>> >> > Not only did the multiply not need to be fixed
>> >>
>> >> That's far from clear, and inconsistent with what this patch is doing,
>> >> since the fix is still being applied (Wouldn't it make sense to clarify
>> >> that in the commit message since it's slightly misleading about it?).
>> >>
>> >> The original instruction was technically violating the first CHV/BXT
>> >> double-precision regioning restriction before the pass was introduced,
>> >> that's why it made any changes in the first place.  The integer
>> >> multiplication lowering code was just lucky enough that violating the
>> >> restriction didn't matter in this case, but I doubt that the reason for
>> >> that had anything to do with the accumulator being the explicit
>> >> destination...
>> >>
>> >
>> > Explicit, no, but I do suspect that does have to do with it being the
>> > accumulator.  This restriction isn't theoretical; if you violate it
>> > with a GRF, you will get data corruption; I've seen it myself.
>>
>> The BSpec language is vague and frequently inconsistent.  Obviously it
>> was being violated before because the text doesn't name the accumulator
>> as an exemption from that rule.  The fact that you've seen it blow up
>> with corruption before doesn't guarantee it will always blow up under
>> the conditions stated on the hardware spec (because those conditions are
>> a highly imperfect abstraction of the hardware logic rather than the
>> hardware logic itself).  It's because the restriction (as it's
>> enunciated in the BSpec) was purely theoretical that the MULH
>> implementation worked in the first place.
>>
>> > I could see two possible explanations:
>> >
>> >  1. Under the hood the accumulator is written with a Q type and an
>> internal
>> > stride of 8 bytes, hence the restriction does apply but is implicitly
>> > satisfied for D type source strides of 1 and 2.
>> >  2. The data path to the accumulator is a special case in the hardware
>> and
>> > doesn't use the normal general-purpose regioning logic and so doesn't
>> > require the restriction.
>> >
>>
>> I don't see any evidence for any of these explanations.  I believe that
>> the actual reason why the MULH implementation didn't suffer the effects
>> of violating these restrictions is that in fact they don't apply to
>> *any* 32x16-bit integer multiply operations at all despite what the
>> hardware spec says, whether the destination is the accumulator or not.
>>
>> I've verified it by doing a daily CI run on the following patch:
>>
>>
>> https://cgit.freedesktop.org/~currojerez/mesa/commit/?h=jenkins=c1a32c4e1e53d70b0c8f6254f0f53f0230b7e21b
>>
>> It disables legalization of the integer multiply instruction and then
>> adds a hack to lower_integer_multiplication() for it to intentionally
>> break the alignment rule.  No regressions on CHV/BXT/GLK.  My reading of
>> the simulator confirms that 32x16-bit multiplication isn't affected by
>> the restriction.
>>
>
> That explanation makes sense especially when combined with the fact that
> DxD -> Q and DxD -> D was added on gen8 and DxW -> D or DxW -> acc0 has
> been around for a long time.
>
>
>> I'm tempted to send a patch that disables regioning alignment lowering
>> for 32x16-bit integer multiplication strictly for performance.  But
>> that's really an orthogonal change to this patch, since due to the issue
>> of precision loss we still need to make sure not to touch accumulator
>> destinations in instructions that *do* have this restriction.
>>
>
> I just searched through the fs code and implementing MULH is the only use
> of the accumulator in the scalar back-end.  Given that it explicitly only
> does a DxW multiply, changing the lowering pass to only apply to DxD
> multiplies would guarantee correct use of the accumulator.  If we think
> this is the real reason, then I'd be a fan of such an approach.
>

"Guarantee correct" sounds somewhat too strong to me.  Yes, it would
prevent the failure you've seen for the time being and until something
in the compiler changes and introduces additional uses of the
accumulator.  Because nothing in the IR guarantees that the accumulator
won't ever be used in combination with a restricted instruction I'd
argue that we want both patches.

> As a side-note, we really should add a NIR opcode for DxD -> Q 

Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Francisco Jerez
Subject is still inaccurate.  How about "intel/fs: Don't touch
accumulator destination while applying regioning alignment rule."

Jason Ekstrand  writes:

> In some shaders, you can end up with a stride in the source of a
> SHADER_OPCODE_MULH.  One way this can happen is if the MULH is acting on
> the top bits of a 64-bit value due to 64-bit integer lowering.  In this
> case, the compiler will produce something like this:
>
> mul(8)acc0<1>UD   g5<8,4,2>UD   0x0004UW  { align1 1Q };
> mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };
>
> The new region fixup pass looks at the MUL and sees a strided source and
> unstrided destination and determines that the sequence is illegal.  It
> then attempts to fix the illegal stride by replacing the destination of
> the MUL with a temporary and emitting a MOV into the accumulator:
>
> mul(8)g9<2>UD g5<8,4,2>UD   0x0004UW  { align1 1Q };
> mov(8)acc0<1>UD   g9<8,4,2>UD { align1 1Q };
> mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };
>
> Unfortunately, this new sequence isn't correct because MOV accesses the
> accumulator with a different precision to MUL and, instead of filling
> the bottom 32 bits with the source and zeroing the top 32 bits, it
> leaves the top 32 (or maybe 31) bits alone and full of garbage.  When
> the MACH comes along and tries to complete the multiplication, the
> result is correct in the bottom 32 bits (which we throw away) and
> garbage in the top 32 bits which are actually returned by MACH.
>
> This commit does two things:  First, it adds an assert to ensure that we
> don't try to rewrite accumulator destinations of MUL instructions so we
> can avoid this precision issue.  Second, it modifies
> required_dst_byte_stride to require a tightly packed stride so that we
> fix up the sources instead and the actual code which gets emitted is
> this:
>
> mov(8)g9<1>UD g5<8,4,2>UD { align1 1Q };
> mul(8)acc0<1>UD   g9<8,8,1>UD   0x0004UW  { align1 1Q };
> mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };
>
> Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> Cc: Francisco Jerez 
> ---
>  src/intel/compiler/brw_fs_lower_regioning.cpp | 24 ++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp 
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> index cc4163b4c2c..00cb5769ebe 100644
> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> @@ -53,7 +53,21 @@ namespace {
> unsigned
> required_dst_byte_stride(const fs_inst *inst)
> {
> -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> +  if (inst->dst.is_accumulator()) {
> + /* If the destination is an accumulator, insist that we leave the
> +  * stride alone.  We cannot "fix" accumulator destinations by 
> writing
> +  * to a temporary and emitting a MOV into the original destination.
> +  * For multiply instructions (our one use of the accumulator), the
> +  * MUL writes the full 66 bits of the accumulator whereas the MOV we
> +  * would emit only writes 33 bits ane leaves the top 33 bits

and*

> +  * undefined.
> +  *
> +  * It's safe to just require the original stride here because the
> +  * lowering pass will detect the mismatch in 
> required_src_byte_stride
> +  * just fix up the sources of the multiply instead of the 
> destination.

There isn't such a thing as "required_src_byte_stride".  Conjunction
missing between that sentence and the "just fix up..." one.

Code is still:

Reviewed-by: Francisco Jerez 

> +  */
> + return inst->dst.stride * type_sz(inst->dst.type);
> +  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
>!is_byte_raw_mov(inst)) {
>   return get_exec_type_size(inst);
>} else {
> @@ -316,6 +330,14 @@ namespace {
> bool
> lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
> {
> +  /* We cannot replace the result of an integer multiply which writes the
> +   * accumulator because MUL+MACH pairs act on the accumulator as a 
> 66-bit
> +   * value whereas the MOV will act on only 32 or 33 bits of the
> +   * accumulator.
> +   */
> +  assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||
> + brw_reg_type_is_floating_point(inst->dst.type));
> +
>const fs_builder ibld(v, block, inst);
>const unsigned stride = required_dst_byte_stride(inst) /
>type_sz(inst->dst.type);
> -- 
> 2.20.1


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Jason Ekstrand
In some shaders, you can end up with a stride in the source of a
SHADER_OPCODE_MULH.  One way this can happen is if the MULH is acting on
the top bits of a 64-bit value due to 64-bit integer lowering.  In this
case, the compiler will produce something like this:

mul(8)acc0<1>UD   g5<8,4,2>UD   0x0004UW  { align1 1Q };
mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };

The new region fixup pass looks at the MUL and sees a strided source and
unstrided destination and determines that the sequence is illegal.  It
then attempts to fix the illegal stride by replacing the destination of
the MUL with a temporary and emitting a MOV into the accumulator:

mul(8)g9<2>UD g5<8,4,2>UD   0x0004UW  { align1 1Q };
mov(8)acc0<1>UD   g9<8,4,2>UD { align1 1Q };
mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };

Unfortunately, this new sequence isn't correct because MOV accesses the
accumulator with a different precision to MUL and, instead of filling
the bottom 32 bits with the source and zeroing the top 32 bits, it
leaves the top 32 (or maybe 31) bits alone and full of garbage.  When
the MACH comes along and tries to complete the multiplication, the
result is correct in the bottom 32 bits (which we throw away) and
garbage in the top 32 bits which are actually returned by MACH.

This commit does two things:  First, it adds an assert to ensure that we
don't try to rewrite accumulator destinations of MUL instructions so we
can avoid this precision issue.  Second, it modifies
required_dst_byte_stride to require a tightly packed stride so that we
fix up the sources instead and the actual code which gets emitted is
this:

mov(8)g9<1>UD g5<8,4,2>UD { align1 1Q };
mul(8)acc0<1>UD   g9<8,8,1>UD   0x0004UW  { align1 1Q };
mach(8)   g6<1>UD g5<8,4,2>UD   0x0004UD  { align1 1Q AccWrEnable };

Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
Cc: Francisco Jerez 
---
 src/intel/compiler/brw_fs_lower_regioning.cpp | 24 ++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp 
b/src/intel/compiler/brw_fs_lower_regioning.cpp
index cc4163b4c2c..00cb5769ebe 100644
--- a/src/intel/compiler/brw_fs_lower_regioning.cpp
+++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
@@ -53,7 +53,21 @@ namespace {
unsigned
required_dst_byte_stride(const fs_inst *inst)
{
-  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
+  if (inst->dst.is_accumulator()) {
+ /* If the destination is an accumulator, insist that we leave the
+  * stride alone.  We cannot "fix" accumulator destinations by writing
+  * to a temporary and emitting a MOV into the original destination.
+  * For multiply instructions (our one use of the accumulator), the
+  * MUL writes the full 66 bits of the accumulator whereas the MOV we
+  * would emit only writes 33 bits ane leaves the top 33 bits
+  * undefined.
+  *
+  * It's safe to just require the original stride here because the
+  * lowering pass will detect the mismatch in required_src_byte_stride
+  * just fix up the sources of the multiply instead of the destination.
+  */
+ return inst->dst.stride * type_sz(inst->dst.type);
+  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
   !is_byte_raw_mov(inst)) {
  return get_exec_type_size(inst);
   } else {
@@ -316,6 +330,14 @@ namespace {
bool
lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
+  /* We cannot replace the result of an integer multiply which writes the
+   * accumulator because MUL+MACH pairs act on the accumulator as a 66-bit
+   * value whereas the MOV will act on only 32 or 33 bits of the
+   * accumulator.
+   */
+  assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||
+ brw_reg_type_is_floating_point(inst->dst.type));
+
   const fs_builder ibld(v, block, inst);
   const unsigned stride = required_dst_byte_stride(inst) /
   type_sz(inst->dst.type);
-- 
2.20.1

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Jason Ekstrand
On Thu, Jan 17, 2019 at 3:34 PM Francisco Jerez 
wrote:

> Jason Ekstrand  writes:
>
> > Bah... previous e-mail unfinished.  Please ignore.
> >
> > On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez 
> > wrote:
> >
> >> Jason Ekstrand  writes:
> >>
> >> > The pass was discovered to cause problems with the MUL+MACH
> combination
> >> > we emit for nir_[iu]mul_high.  In an experimental branch of mine, I
> ran
> >> > into issues where the MUL+MACH ended up using a strided source due to
> >> > working on half of a uint64_t and the new lowering pass helpfully
> tried
> >> > to fix the multiply which wrote to an unstriated accumulator.
> >>
> >> > Not only did the multiply not need to be fixed
> >>
> >> That's far from clear, and inconsistent with what this patch is doing,
> >> since the fix is still being applied (Wouldn't it make sense to clarify
> >> that in the commit message since it's slightly misleading about it?).
> >>
> >> The original instruction was technically violating the first CHV/BXT
> >> double-precision regioning restriction before the pass was introduced,
> >> that's why it made any changes in the first place.  The integer
> >> multiplication lowering code was just lucky enough that violating the
> >> restriction didn't matter in this case, but I doubt that the reason for
> >> that had anything to do with the accumulator being the explicit
> >> destination...
> >>
> >
> > Explicit, no, but I do suspect that does have to do with it being the
> > accumulator.  This restriction isn't theoretical; if you violate it
> > with a GRF, you will get data corruption; I've seen it myself.
>
> The BSpec language is vague and frequently inconsistent.  Obviously it
> was being violated before because the text doesn't name the accumulator
> as an exemption from that rule.  The fact that you've seen it blow up
> with corruption before doesn't guarantee it will always blow up under
> the conditions stated on the hardware spec (because those conditions are
> a highly imperfect abstraction of the hardware logic rather than the
> hardware logic itself).  It's because the restriction (as it's
> enunciated in the BSpec) was purely theoretical that the MULH
> implementation worked in the first place.
>
> > I could see two possible explanations:
> >
> >  1. Under the hood the accumulator is written with a Q type and an
> internal
> > stride of 8 bytes, hence the restriction does apply but is implicitly
> > satisfied for D type source strides of 1 and 2.
> >  2. The data path to the accumulator is a special case in the hardware
> and
> > doesn't use the normal general-purpose regioning logic and so doesn't
> > require the restriction.
> >
>
> I don't see any evidence for any of these explanations.  I believe that
> the actual reason why the MULH implementation didn't suffer the effects
> of violating these restrictions is that in fact they don't apply to
> *any* 32x16-bit integer multiply operations at all despite what the
> hardware spec says, whether the destination is the accumulator or not.
>
> I've verified it by doing a daily CI run on the following patch:
>
>
> https://cgit.freedesktop.org/~currojerez/mesa/commit/?h=jenkins=c1a32c4e1e53d70b0c8f6254f0f53f0230b7e21b
>
> It disables legalization of the integer multiply instruction and then
> adds a hack to lower_integer_multiplication() for it to intentionally
> break the alignment rule.  No regressions on CHV/BXT/GLK.  My reading of
> the simulator confirms that 32x16-bit multiplication isn't affected by
> the restriction.
>

That explanation makes sense especially when combined with the fact that
DxD -> Q and DxD -> D was added on gen8 and DxW -> D or DxW -> acc0 has
been around for a long time.


> I'm tempted to send a patch that disables regioning alignment lowering
> for 32x16-bit integer multiplication strictly for performance.  But
> that's really an orthogonal change to this patch, since due to the issue
> of precision loss we still need to make sure not to touch accumulator
> destinations in instructions that *do* have this restriction.
>

I just searched through the fs code and implementing MULH is the only use
of the accumulator in the scalar back-end.  Given that it explicitly only
does a DxW multiply, changing the lowering pass to only apply to DxD
multiplies would guarantee correct use of the accumulator.  If we think
this is the real reason, then I'd be a fan of such an approach.

As a side-note, we really should add a NIR opcode for DxD -> Q multiply and
use that when lowering 64-bit multiplicatoin.

> I don't find the first one very convincing at all.  Among other things, if
> > it were the true reason, it would imply that we would need to use a
> stride
> > of exactly 2 on D type sources which but empirical evidence suggests that
> > "mul(8) acc0<1> g5<8,8,1>UD g9<16,8,2>UW" works just fine.
> >
> >
> >> > but the "fix" ended up breaking it because a MOV to the accumulator is
> >> > not the same as using it as a multiply destination due to the 

Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Francisco Jerez
Jason Ekstrand  writes:

> Bah... previous e-mail unfinished.  Please ignore.
>
> On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez 
> wrote:
>
>> Jason Ekstrand  writes:
>>
>> > The pass was discovered to cause problems with the MUL+MACH combination
>> > we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
>> > into issues where the MUL+MACH ended up using a strided source due to
>> > working on half of a uint64_t and the new lowering pass helpfully tried
>> > to fix the multiply which wrote to an unstriated accumulator.
>>
>> > Not only did the multiply not need to be fixed
>>
>> That's far from clear, and inconsistent with what this patch is doing,
>> since the fix is still being applied (Wouldn't it make sense to clarify
>> that in the commit message since it's slightly misleading about it?).
>>
>> The original instruction was technically violating the first CHV/BXT
>> double-precision regioning restriction before the pass was introduced,
>> that's why it made any changes in the first place.  The integer
>> multiplication lowering code was just lucky enough that violating the
>> restriction didn't matter in this case, but I doubt that the reason for
>> that had anything to do with the accumulator being the explicit
>> destination...
>>
>
> Explicit, no, but I do suspect that does have to do with it being the
> accumulator.  This restriction isn't theoretical; if you violate it
> with a GRF, you will get data corruption; I've seen it myself.

The BSpec language is vague and frequently inconsistent.  Obviously it
was being violated before because the text doesn't name the accumulator
as an exemption from that rule.  The fact that you've seen it blow up
with corruption before doesn't guarantee it will always blow up under
the conditions stated on the hardware spec (because those conditions are
a highly imperfect abstraction of the hardware logic rather than the
hardware logic itself).  It's because the restriction (as it's
enunciated in the BSpec) was purely theoretical that the MULH
implementation worked in the first place.

> I could see two possible explanations:
>
>  1. Under the hood the accumulator is written with a Q type and an internal
> stride of 8 bytes, hence the restriction does apply but is implicitly
> satisfied for D type source strides of 1 and 2.
>  2. The data path to the accumulator is a special case in the hardware and
> doesn't use the normal general-purpose regioning logic and so doesn't
> require the restriction.
>

I don't see any evidence for any of these explanations.  I believe that
the actual reason why the MULH implementation didn't suffer the effects
of violating these restrictions is that in fact they don't apply to
*any* 32x16-bit integer multiply operations at all despite what the
hardware spec says, whether the destination is the accumulator or not.

I've verified it by doing a daily CI run on the following patch:

https://cgit.freedesktop.org/~currojerez/mesa/commit/?h=jenkins=c1a32c4e1e53d70b0c8f6254f0f53f0230b7e21b

It disables legalization of the integer multiply instruction and then
adds a hack to lower_integer_multiplication() for it to intentionally
break the alignment rule.  No regressions on CHV/BXT/GLK.  My reading of
the simulator confirms that 32x16-bit multiplication isn't affected by
the restriction.

I'm tempted to send a patch that disables regioning alignment lowering
for 32x16-bit integer multiplication strictly for performance.  But
that's really an orthogonal change to this patch, since due to the issue
of precision loss we still need to make sure not to touch accumulator
destinations in instructions that *do* have this restriction.

> I don't find the first one very convincing at all.  Among other things, if
> it were the true reason, it would imply that we would need to use a stride
> of exactly 2 on D type sources which but empirical evidence suggests that
> "mul(8) acc0<1> g5<8,8,1>UD g9<16,8,2>UW" works just fine.
>
>
>> > but the "fix" ended up breaking it because a MOV to the accumulator is
>> > not the same as using it as a multiply destination due to the magic
>> > way the 33/64 bits of the
>>
>> Technically it has 66 bits (it wasn't a typo when I said that to you
>> earlier on IRC).  That's how it can t hold the result of a SIMD16
>> 16x16-bit integer multiplication with 33-bit signed precision per scalar
>> component.
>>
>
> Yes, there are 33 bits available for WxW multiplies but this is dealing
> with a DxD multiply which only has 64 bits according to this bit of bspec
> text:
>
> As there are only 64 bits per channel in DWord mode (D and UD), it is
> sufficient to store the multiplication result of two DWord operands as long
> as the post source modified sources are still within 32 bits. If any one
> source is type UD and is negated, the negated result becomes 33 bits. The
> DWord multiplication result is then 65 bits, bigger than the storage
> capacity of accumulators. Consequently, the results are unpredictable.
>
>


Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Jason Ekstrand
Bah... previous e-mail unfinished.  Please ignore.

On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez 
wrote:

> Jason Ekstrand  writes:
>
> > The pass was discovered to cause problems with the MUL+MACH combination
> > we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
> > into issues where the MUL+MACH ended up using a strided source due to
> > working on half of a uint64_t and the new lowering pass helpfully tried
> > to fix the multiply which wrote to an unstriated accumulator.
>
> > Not only did the multiply not need to be fixed
>
> That's far from clear, and inconsistent with what this patch is doing,
> since the fix is still being applied (Wouldn't it make sense to clarify
> that in the commit message since it's slightly misleading about it?).
>
> The original instruction was technically violating the first CHV/BXT
> double-precision regioning restriction before the pass was introduced,
> that's why it made any changes in the first place.  The integer
> multiplication lowering code was just lucky enough that violating the
> restriction didn't matter in this case, but I doubt that the reason for
> that had anything to do with the accumulator being the explicit
> destination...
>

Explicit, no, but I do suspect that does have to do with it being the
accumulator.  This restriction isn't theoretical; if you violate it with a
GRF, you will get data corruption; I've seen it myself.  I could see two
possible explanations:

 1. Under the hood the accumulator is written with a Q type and an internal
stride of 8 bytes, hence the restriction does apply but is implicitly
satisfied for D type source strides of 1 and 2.
 2. The data path to the accumulator is a special case in the hardware and
doesn't use the normal general-purpose regioning logic and so doesn't
require the restriction.

I don't find the first one very convincing at all.  Among other things, if
it were the true reason, it would imply that we would need to use a stride
of exactly 2 on D type sources which but empirical evidence suggests that
"mul(8) acc0<1> g5<8,8,1>UD g9<16,8,2>UW" works just fine.


> > but the "fix" ended up breaking it because a MOV to the accumulator is
> > not the same as using it as a multiply destination due to the magic
> > way the 33/64 bits of the
>
> Technically it has 66 bits (it wasn't a typo when I said that to you
> earlier on IRC).  That's how it can t hold the result of a SIMD16
> 16x16-bit integer multiplication with 33-bit signed precision per scalar
> component.
>

Yes, there are 33 bits available for WxW multiplies but this is dealing
with a DxD multiply which only has 64 bits according to this bit of bspec
text:

As there are only 64 bits per channel in DWord mode (D and UD), it is
sufficient to store the multiplication result of two DWord operands as long
as the post source modified sources are still within 32 bits. If any one
source is type UD and is negated, the negated result becomes 33 bits. The
DWord multiplication result is then 65 bits, bigger than the storage
capacity of accumulators. Consequently, the results are unpredictable.


> > accumulator are handled for different instruction types.
> >
> > Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> > Cc: Francisco Jerez 
> > ---
> >  src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > index cc4163b4c2c..b8a89e82272 100644
> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > @@ -53,7 +53,13 @@ namespace {
> > unsigned
> > required_dst_byte_stride(const fs_inst *inst)
> > {
> > -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> > +  if (inst->dst.is_accumulator()) {
> > + /* Even though it's not explicitly documented in the PRMs or
> the
> > +  * BSpec, writes to the accumulator appear to not need any
> special
> > +  * treatment with respect too their destination stride
> alignment.
> > +  */
>
> The code is not really doing what the comment says.  The
> destination/source stride alignment restriction will still be honored
> for this instruction.  It's just that the destination *has* to be left
> untouched while doing that in the case of an integer MUL/MACH
> instruction (that's the only reason I asked you to return the original
> byte stride of the destination), because splitting off the region into a
> MOV would lead to data loss due to the inconsistent semantics of the
> accumulator destination for integer MUL/MACH (which update the whole 66
> bits) and every other integer arithmetic instruction (which update the
> bottom 33 bits and *apparently* leave the top 33 bits uninitialized) --
> IOW this is only here so that the assert below doesn't fire.
>

Ok, now I'm very confused.  It sounds to me like you 

Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Jason Ekstrand
On Thu, Jan 17, 2019 at 4:15 AM Francisco Jerez 
wrote:

> Jason Ekstrand  writes:
>
> > The pass was discovered to cause problems with the MUL+MACH combination
> > we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
> > into issues where the MUL+MACH ended up using a strided source due to
> > working on half of a uint64_t and the new lowering pass helpfully tried
> > to fix the multiply which wrote to an unstriated accumulator.
>
> > Not only did the multiply not need to be fixed
>
> That's far from clear, and inconsistent with what this patch is doing,
> since the fix is still being applied (Wouldn't it make sense to clarify
> that in the commit message since it's slightly misleading about it?).
>
> The original instruction was technically violating the first CHV/BXT
> double-precision regioning restriction before the pass was introduced,
> that's why it made any changes in the first place.  The integer
> multiplication lowering code was just lucky enough that violating the
> restriction didn't matter in this case, but I doubt that the reason for
> that had anything to do with the accumulator being the explicit
> destination...
>

Explicit, no, but I do suspect that does have to do with it being the
accumulator.  This restriction isn't theoretical; if you violate it with a
GRF, you will get data corruption; I've seen it myself.  I could see two
possible explanations:

   However, accumulators are a special case that does appear to work.


> > but the "fix" ended up breaking it because a MOV to the accumulator is
> > not the same as using it as a multiply destination due to the magic
> > way the 33/64 bits of the
>
> Technically it has 66 bits (it wasn't a typo when I said that to you
> earlier on IRC).  That's how it can t hold the result of a SIMD16
> 16x16-bit integer multiplication with 33-bit signed precision per scalar
> component.
>

Yes, there are 33 bits available for WxW multiplies but this is dealing
with a DxD multiply which only has 64 bits according to this bit of bspec
text:

As there are only 64 bits per channel in DWord mode (D and UD), it is
sufficient to store the multiplication result of two DWord operands as long
as the post source modified sources are still within 32 bits. If any one
source is type UD and is negated, the negated result becomes 33 bits. The
DWord multiplication result is then 65 bits, bigger than the storage
capacity of accumulators. Consequently, the results are unpredictable.


> > accumulator are handled for different instruction types.
> >
> > Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> > Cc: Francisco Jerez 
> > ---
> >  src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > index cc4163b4c2c..b8a89e82272 100644
> > --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> > +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> > @@ -53,7 +53,13 @@ namespace {
> > unsigned
> > required_dst_byte_stride(const fs_inst *inst)
> > {
> > -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> > +  if (inst->dst.is_accumulator()) {
> > + /* Even though it's not explicitly documented in the PRMs or
> the
> > +  * BSpec, writes to the accumulator appear to not need any
> special
> > +  * treatment with respect too their destination stride
> alignment.
> > +  */
>
> The code is not really doing what the comment says.  The
> destination/source stride alignment restriction will still be honored
> for this instruction.  It's just that the destination *has* to be left
> untouched while doing that in the case of an integer MUL/MACH
> instruction (that's the only reason I asked you to return the original
> byte stride of the destination), because splitting off the region into a
> MOV would lead to data loss due to the inconsistent semantics of the
> accumulator destination for integer MUL/MACH (which update the whole 66
> bits) and every other integer arithmetic instruction (which update the
> bottom 33 bits and *apparently* leave the top 33 bits uninitialized) --
> IOW this is only here so that the assert below doesn't fire.
>
> > + return inst->dst.stride * type_sz(inst->dst.type);
> > +  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
>
> The code changes themselves are just as I wished, so this gets my:
>
> Reviewed-by: Francisco Jerez 
>
> assuming that you clarify the commit message and comment above.
>
> >!is_byte_raw_mov(inst)) {
> >   return get_exec_type_size(inst);
> >} else {
> > @@ -316,6 +322,14 @@ namespace {
> > bool
> > lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
> > {
> > +  /* We cannot replace the result of an integer multiply which
> writes the
> > +   * accumulator because MUL+MACH 

Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-17 Thread Francisco Jerez
Jason Ekstrand  writes:

> The pass was discovered to cause problems with the MUL+MACH combination
> we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
> into issues where the MUL+MACH ended up using a strided source due to
> working on half of a uint64_t and the new lowering pass helpfully tried
> to fix the multiply which wrote to an unstriated accumulator.

> Not only did the multiply not need to be fixed

That's far from clear, and inconsistent with what this patch is doing,
since the fix is still being applied (Wouldn't it make sense to clarify
that in the commit message since it's slightly misleading about it?).

The original instruction was technically violating the first CHV/BXT
double-precision regioning restriction before the pass was introduced,
that's why it made any changes in the first place.  The integer
multiplication lowering code was just lucky enough that violating the
restriction didn't matter in this case, but I doubt that the reason for
that had anything to do with the accumulator being the explicit
destination...

> but the "fix" ended up breaking it because a MOV to the accumulator is
> not the same as using it as a multiply destination due to the magic
> way the 33/64 bits of the

Technically it has 66 bits (it wasn't a typo when I said that to you
earlier on IRC).  That's how it can t hold the result of a SIMD16
16x16-bit integer multiplication with 33-bit signed precision per scalar
component.

> accumulator are handled for different instruction types.
>
> Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> Cc: Francisco Jerez 
> ---
>  src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp 
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> index cc4163b4c2c..b8a89e82272 100644
> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> @@ -53,7 +53,13 @@ namespace {
> unsigned
> required_dst_byte_stride(const fs_inst *inst)
> {
> -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> +  if (inst->dst.is_accumulator()) {
> + /* Even though it's not explicitly documented in the PRMs or the
> +  * BSpec, writes to the accumulator appear to not need any special
> +  * treatment with respect too their destination stride alignment.
> +  */

The code is not really doing what the comment says.  The
destination/source stride alignment restriction will still be honored
for this instruction.  It's just that the destination *has* to be left
untouched while doing that in the case of an integer MUL/MACH
instruction (that's the only reason I asked you to return the original
byte stride of the destination), because splitting off the region into a
MOV would lead to data loss due to the inconsistent semantics of the
accumulator destination for integer MUL/MACH (which update the whole 66
bits) and every other integer arithmetic instruction (which update the
bottom 33 bits and *apparently* leave the top 33 bits uninitialized) --
IOW this is only here so that the assert below doesn't fire.

> + return inst->dst.stride * type_sz(inst->dst.type);
> +  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&

The code changes themselves are just as I wished, so this gets my:

Reviewed-by: Francisco Jerez 

assuming that you clarify the commit message and comment above.

>!is_byte_raw_mov(inst)) {
>   return get_exec_type_size(inst);
>} else {
> @@ -316,6 +322,14 @@ namespace {
> bool
> lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
> {
> +  /* We cannot replace the result of an integer multiply which writes the
> +   * accumulator because MUL+MACH pairs act on the accumulator as a 
> 64-bit
> +   * value whereas the MOV will act on only 32 or 33 bits of the
> +   * accumulator.
> +   */
> +  assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||
> + brw_reg_type_is_floating_point(inst->dst.type));
> +
>const fs_builder ibld(v, block, inst);
>const unsigned stride = required_dst_byte_stride(inst) /
>type_sz(inst->dst.type);
> -- 
> 2.20.1


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-16 Thread Matt Turner
On Wed, Jan 16, 2019 at 8:40 PM Jason Ekstrand  wrote:
>
> The pass was discovered to cause problems with the MUL+MACH combination
> we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
> into issues where the MUL+MACH ended up using a strided source due to
> working on half of a uint64_t and the new lowering pass helpfully tried
> to fix the multiply which wrote to an unstriated accumulator.  Not only

unstrided

> did the multiply not need to be fixed but the "fix" ended up breaking it
> because a MOV to the accumulator is not the same as using it as a
> multiply destination due to the magic way the 33/64 bits of the
> accumulator are handled for different instruction types.
>
> Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
> Cc: Francisco Jerez 
> ---
>  src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp 
> b/src/intel/compiler/brw_fs_lower_regioning.cpp
> index cc4163b4c2c..b8a89e82272 100644
> --- a/src/intel/compiler/brw_fs_lower_regioning.cpp
> +++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
> @@ -53,7 +53,13 @@ namespace {
> unsigned
> required_dst_byte_stride(const fs_inst *inst)
> {
> -  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
> +  if (inst->dst.is_accumulator()) {
> + /* Even though it's not explicitly documented in the PRMs or the
> +  * BSpec, writes to the accumulator appear to not need any special
> +  * treatment with respect too their destination stride alignment.

to


I read the backlog on IRC and I still don't understand, but that's okay :)
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] intel/fs: Don't apply the des stride alignment rule to accumulators

2019-01-16 Thread Jason Ekstrand
The pass was discovered to cause problems with the MUL+MACH combination
we emit for nir_[iu]mul_high.  In an experimental branch of mine, I ran
into issues where the MUL+MACH ended up using a strided source due to
working on half of a uint64_t and the new lowering pass helpfully tried
to fix the multiply which wrote to an unstriated accumulator.  Not only
did the multiply not need to be fixed but the "fix" ended up breaking it
because a MOV to the accumulator is not the same as using it as a
multiply destination due to the magic way the 33/64 bits of the
accumulator are handled for different instruction types.

Fixes: efa4e4bc5fc "intel/fs: Introduce regioning lowering pass"
Cc: Francisco Jerez 
---
 src/intel/compiler/brw_fs_lower_regioning.cpp | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/intel/compiler/brw_fs_lower_regioning.cpp 
b/src/intel/compiler/brw_fs_lower_regioning.cpp
index cc4163b4c2c..b8a89e82272 100644
--- a/src/intel/compiler/brw_fs_lower_regioning.cpp
+++ b/src/intel/compiler/brw_fs_lower_regioning.cpp
@@ -53,7 +53,13 @@ namespace {
unsigned
required_dst_byte_stride(const fs_inst *inst)
{
-  if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
+  if (inst->dst.is_accumulator()) {
+ /* Even though it's not explicitly documented in the PRMs or the
+  * BSpec, writes to the accumulator appear to not need any special
+  * treatment with respect too their destination stride alignment.
+  */
+ return inst->dst.stride * type_sz(inst->dst.type);
+  } else if (type_sz(inst->dst.type) < get_exec_type_size(inst) &&
   !is_byte_raw_mov(inst)) {
  return get_exec_type_size(inst);
   } else {
@@ -316,6 +322,14 @@ namespace {
bool
lower_dst_region(fs_visitor *v, bblock_t *block, fs_inst *inst)
{
+  /* We cannot replace the result of an integer multiply which writes the
+   * accumulator because MUL+MACH pairs act on the accumulator as a 64-bit
+   * value whereas the MOV will act on only 32 or 33 bits of the
+   * accumulator.
+   */
+  assert(inst->opcode != BRW_OPCODE_MUL || !inst->dst.is_accumulator() ||
+ brw_reg_type_is_floating_point(inst->dst.type));
+
   const fs_builder ibld(v, block, inst);
   const unsigned stride = required_dst_byte_stride(inst) /
   type_sz(inst->dst.type);
-- 
2.20.1

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev