Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-14 Thread Pohjolainen, Topi
On Fri, Dec 14, 2018 at 08:47:23AM +0100, Iago Toral wrote:
> On Thu, 2018-12-13 at 12:49 +0200, Pohjolainen, Topi wrote:
> > On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
> > > On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
> > > > On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> > > > > On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > > > > > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> > > > > > wrote:
> > > > > > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral
> > > > > > > Quiroga
> > > > > > > wrote:
> > > > > > > > This function is used in two different scenarios that for
> > > > > > > > 32-
> > > > > > > > bit
> > > > > > > > instructions are the same, but for 16-bit instructions
> > > > > > > > are
> > > > > > > > not.
> > > > > > > > 
> > > > > > > > One scenario is that in which we are working at a SIMD8
> > > > > > > > register
> > > > > > > > level and we need to know if a register is fully defined
> > > > > > > > or
> > > > > > > > written.
> > > > > > > > This is useful, for example, in the context of liveness
> > > > > > > > analysis
> > > > > > > > or
> > > > > > > > register allocation, where we work with units of
> > > > > > > > registers.
> > > > > > > > 
> > > > > > > > The other scenario is that in which we want to know if an
> > > > > > > > instruction
> > > > > > > > is writing a full scalar component or just some subset of
> > > > > > > > it.
> > > > > > > > This is
> > > > > > > > useful, for example, in the context of some optimization
> > > > > > > > passes
> > > > > > > > like copy propagation.
> > > > > > > > 
> > > > > > > > For 32-bit instructions (or larger), a SIMD8 dispatch
> > > > > > > > will
> > > > > > > > always
> > > > > > > > write
> > > > > > > > at least a full SIMD8 register (32B) if the write is not
> > > > > > > > partial.
> > > > > > > > The
> > > > > > > > function is_partial_write() checks this to determine if
> > > > > > > > we
> > > > > > > > have a
> > > > > > > > partial
> > > > > > > > write. However, when we deal with 16-bit instructions,
> > > > > > > > that
> > > > > > > > logic
> > > > > > > > disables
> > > > > > > > some optimizations that should be safe. For example, a
> > > > > > > > SIMD8
> > > > > > > > 16-
> > > > > > > > bit MOV will
> > > > > > > > only update half of a SIMD register, but it is still a
> > > > > > > > complete
> > > > > > > > write of the
> > > > > > > > variable for a SIMD8 dispatch, so we should not prevent
> > > > > > > > copy
> > > > > > > > propagation in
> > > > > > > > this scenario because we don't write all 32 bytes in the
> > > > > > > > SIMD
> > > > > > > > register
> > > > > > > > or because the write starts at offset 16B (wehere we pack
> > > > > > > > components Y or
> > > > > > > > W of 16-bit vectors).
> > > > > > > > 
> > > > > > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS)
> > > > > > > > of
> > > > > > > > 16-
> > > > > > > > bit
> > > > > > > > instructions, which lose a number of optimizations
> > > > > > > > because of
> > > > > > > > this, most
> > > > > > > > important of which is copy-propagation.
> > > > > > > > 
> > > > > > > > This patch splits is_partial_write() into
> > > > > > > > is_partial_reg_write(),
> > > > > > > > which
> > > > > > > > represents the current is_partial_write(), useful for
> > > > > > > > things
> > > > > > > > like
> > > > > > > > liveness analysis, and is_partial_var_write(), which
> > > > > > > > considers
> > > > > > > > the dispatch size to check if we are writing a full
> > > > > > > > variable
> > > > > > > > (rather
> > > > > > > > than a full register) to decide if the write is partial
> > > > > > > > or
> > > > > > > > not,
> > > > > > > > which
> > > > > > > > is what we really want in many optimization passes.
> > > > > > 
> > > > > > I actually started wondering why would liveness analysis and
> > > > > > register
> > > > > > coalescing need to treat the 16-bit SIMD8 case differently
> > > > > > than
> > > > > > optimizations.
> > > > > > In virtual register space nothing would read or write the
> > > > > > unused
> > > > > > second half
> > > > > > of the register in case of 16-bit type and SIMD8.
> > > > > 
> > > > > True, we might be able to use the "variable" version in more
> > > > > cases.
> > > > > I
> > > > > was trying to be conservative when I implemented this because I
> > > > > don't
> > > > > think the half-float CTS tests provides a good testing ground
> > > > > for
> > > > > all
> > > > > aspects of the compiler. I can try that and see if it breaks
> > > > > anything
> > > > > though.
> > > > >  
> > > > > > Real register allocation in turn should be orthogonal to how
> > > > > > things
> > > > > > are
> > > > > > allocated in virtual space. And I guess even there we
> > > > > > wouldn't be
> > > > > > interested
> > > > > > of packing two 16-bit typed SIMD8 variables into one and same
> > > > > > hardware
> > > > > > register. It is SIMD16 where 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-13 Thread Iago Toral
On Thu, 2018-12-13 at 12:49 +0200, Pohjolainen, Topi wrote:
> On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
> > On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
> > > On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> > > > On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > > > > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> > > > > wrote:
> > > > > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral
> > > > > > Quiroga
> > > > > > wrote:
> > > > > > > This function is used in two different scenarios that for
> > > > > > > 32-
> > > > > > > bit
> > > > > > > instructions are the same, but for 16-bit instructions
> > > > > > > are
> > > > > > > not.
> > > > > > > 
> > > > > > > One scenario is that in which we are working at a SIMD8
> > > > > > > register
> > > > > > > level and we need to know if a register is fully defined
> > > > > > > or
> > > > > > > written.
> > > > > > > This is useful, for example, in the context of liveness
> > > > > > > analysis
> > > > > > > or
> > > > > > > register allocation, where we work with units of
> > > > > > > registers.
> > > > > > > 
> > > > > > > The other scenario is that in which we want to know if an
> > > > > > > instruction
> > > > > > > is writing a full scalar component or just some subset of
> > > > > > > it.
> > > > > > > This is
> > > > > > > useful, for example, in the context of some optimization
> > > > > > > passes
> > > > > > > like copy propagation.
> > > > > > > 
> > > > > > > For 32-bit instructions (or larger), a SIMD8 dispatch
> > > > > > > will
> > > > > > > always
> > > > > > > write
> > > > > > > at least a full SIMD8 register (32B) if the write is not
> > > > > > > partial.
> > > > > > > The
> > > > > > > function is_partial_write() checks this to determine if
> > > > > > > we
> > > > > > > have a
> > > > > > > partial
> > > > > > > write. However, when we deal with 16-bit instructions,
> > > > > > > that
> > > > > > > logic
> > > > > > > disables
> > > > > > > some optimizations that should be safe. For example, a
> > > > > > > SIMD8
> > > > > > > 16-
> > > > > > > bit MOV will
> > > > > > > only update half of a SIMD register, but it is still a
> > > > > > > complete
> > > > > > > write of the
> > > > > > > variable for a SIMD8 dispatch, so we should not prevent
> > > > > > > copy
> > > > > > > propagation in
> > > > > > > this scenario because we don't write all 32 bytes in the
> > > > > > > SIMD
> > > > > > > register
> > > > > > > or because the write starts at offset 16B (wehere we pack
> > > > > > > components Y or
> > > > > > > W of 16-bit vectors).
> > > > > > > 
> > > > > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS)
> > > > > > > of
> > > > > > > 16-
> > > > > > > bit
> > > > > > > instructions, which lose a number of optimizations
> > > > > > > because of
> > > > > > > this, most
> > > > > > > important of which is copy-propagation.
> > > > > > > 
> > > > > > > This patch splits is_partial_write() into
> > > > > > > is_partial_reg_write(),
> > > > > > > which
> > > > > > > represents the current is_partial_write(), useful for
> > > > > > > things
> > > > > > > like
> > > > > > > liveness analysis, and is_partial_var_write(), which
> > > > > > > considers
> > > > > > > the dispatch size to check if we are writing a full
> > > > > > > variable
> > > > > > > (rather
> > > > > > > than a full register) to decide if the write is partial
> > > > > > > or
> > > > > > > not,
> > > > > > > which
> > > > > > > is what we really want in many optimization passes.
> > > > > 
> > > > > I actually started wondering why would liveness analysis and
> > > > > register
> > > > > coalescing need to treat the 16-bit SIMD8 case differently
> > > > > than
> > > > > optimizations.
> > > > > In virtual register space nothing would read or write the
> > > > > unused
> > > > > second half
> > > > > of the register in case of 16-bit type and SIMD8.
> > > > 
> > > > True, we might be able to use the "variable" version in more
> > > > cases.
> > > > I
> > > > was trying to be conservative when I implemented this because I
> > > > don't
> > > > think the half-float CTS tests provides a good testing ground
> > > > for
> > > > all
> > > > aspects of the compiler. I can try that and see if it breaks
> > > > anything
> > > > though.
> > > >  
> > > > > Real register allocation in turn should be orthogonal to how
> > > > > things
> > > > > are
> > > > > allocated in virtual space. And I guess even there we
> > > > > wouldn't be
> > > > > interested
> > > > > of packing two 16-bit typed SIMD8 variables into one and same
> > > > > hardware
> > > > > register. It is SIMD16 where we get more pressure into
> > > > > register
> > > > > space
> > > > > anyway
> > > > > and there the 16-bit typed variables occupy full registers.
> > > > > In
> > > > > other
> > > > > words,
> > > > > if things fit in SIMD16, would we bother packing things more
> > > > > tightly
> > > > > in
> > > > > SIMD8? 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-13 Thread Chema Casanova


El 13/12/18 a las 11:49, Pohjolainen, Topi escribió:
> On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
>> On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
>>> On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
 On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> wrote:
>> On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga
>> wrote:
>>> This function is used in two different scenarios that for 32-
>>> bit
>>> instructions are the same, but for 16-bit instructions are
>>> not.
>>>
>>> One scenario is that in which we are working at a SIMD8
>>> register
>>> level and we need to know if a register is fully defined or
>>> written.
>>> This is useful, for example, in the context of liveness
>>> analysis
>>> or
>>> register allocation, where we work with units of registers.
>>>
>>> The other scenario is that in which we want to know if an
>>> instruction
>>> is writing a full scalar component or just some subset of it.
>>> This is
>>> useful, for example, in the context of some optimization
>>> passes
>>> like copy propagation.
>>>
>>> For 32-bit instructions (or larger), a SIMD8 dispatch will
>>> always
>>> write
>>> at least a full SIMD8 register (32B) if the write is not
>>> partial.
>>> The
>>> function is_partial_write() checks this to determine if we
>>> have a
>>> partial
>>> write. However, when we deal with 16-bit instructions, that
>>> logic
>>> disables
>>> some optimizations that should be safe. For example, a SIMD8
>>> 16-
>>> bit MOV will
>>> only update half of a SIMD register, but it is still a
>>> complete
>>> write of the
>>> variable for a SIMD8 dispatch, so we should not prevent copy
>>> propagation in
>>> this scenario because we don't write all 32 bytes in the SIMD
>>> register
>>> or because the write starts at offset 16B (wehere we pack
>>> components Y or
>>> W of 16-bit vectors).
>>>
>>> This is a problem for SIMD8 executions (VS, TCS, TES, GS) of
>>> 16-
>>> bit
>>> instructions, which lose a number of optimizations because of
>>> this, most
>>> important of which is copy-propagation.
>>>
>>> This patch splits is_partial_write() into
>>> is_partial_reg_write(),
>>> which
>>> represents the current is_partial_write(), useful for things
>>> like
>>> liveness analysis, and is_partial_var_write(), which
>>> considers
>>> the dispatch size to check if we are writing a full variable
>>> (rather
>>> than a full register) to decide if the write is partial or
>>> not,
>>> which
>>> is what we really want in many optimization passes.
>
> I actually started wondering why would liveness analysis and
> register
> coalescing need to treat the 16-bit SIMD8 case differently than
> optimizations.
> In virtual register space nothing would read or write the unused
> second half
> of the register in case of 16-bit type and SIMD8.

 True, we might be able to use the "variable" version in more cases.
 I
 was trying to be conservative when I implemented this because I
 don't
 think the half-float CTS tests provides a good testing ground for
 all
 aspects of the compiler. I can try that and see if it breaks
 anything
 though.
  
> Real register allocation in turn should be orthogonal to how
> things
> are
> allocated in virtual space. And I guess even there we wouldn't be
> interested
> of packing two 16-bit typed SIMD8 variables into one and same
> hardware
> register. It is SIMD16 where we get more pressure into register
> space
> anyway
> and there the 16-bit typed variables occupy full registers. In
> other
> words,
> if things fit in SIMD16, would we bother packing things more
> tightly
> in
> SIMD8? Or even if SIMD8 was the only option, would we be
> interested
> packing
> channels for two variables in one hw reg even then?
>
> Jason, we discussed this a little in the spring time.
>
> As a recap my approach shortly. Instead of ignoring the second
> half
> of
> registers case by case I addressed it more generally:
>
> - changed all the open coded checks to use helpers,
> - added a padding bit into fs_reg telling about the unused space,
> - change nir -> fs step to set that bit for 16-bit typed regs
> - and finally changed the helpers to consider the padding bit.

 So if I understand how this works, you mostly make the vgrf
 infrastructure think that half-float registers actually use twice
 the
 space they require by including the padding into the
 component_size()
 helper, 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-13 Thread Pohjolainen, Topi
On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
> On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
> > On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> > > On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > > > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> > > > wrote:
> > > > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga
> > > > > wrote:
> > > > > > This function is used in two different scenarios that for 32-
> > > > > > bit
> > > > > > instructions are the same, but for 16-bit instructions are
> > > > > > not.
> > > > > > 
> > > > > > One scenario is that in which we are working at a SIMD8
> > > > > > register
> > > > > > level and we need to know if a register is fully defined or
> > > > > > written.
> > > > > > This is useful, for example, in the context of liveness
> > > > > > analysis
> > > > > > or
> > > > > > register allocation, where we work with units of registers.
> > > > > > 
> > > > > > The other scenario is that in which we want to know if an
> > > > > > instruction
> > > > > > is writing a full scalar component or just some subset of it.
> > > > > > This is
> > > > > > useful, for example, in the context of some optimization
> > > > > > passes
> > > > > > like copy propagation.
> > > > > > 
> > > > > > For 32-bit instructions (or larger), a SIMD8 dispatch will
> > > > > > always
> > > > > > write
> > > > > > at least a full SIMD8 register (32B) if the write is not
> > > > > > partial.
> > > > > > The
> > > > > > function is_partial_write() checks this to determine if we
> > > > > > have a
> > > > > > partial
> > > > > > write. However, when we deal with 16-bit instructions, that
> > > > > > logic
> > > > > > disables
> > > > > > some optimizations that should be safe. For example, a SIMD8
> > > > > > 16-
> > > > > > bit MOV will
> > > > > > only update half of a SIMD register, but it is still a
> > > > > > complete
> > > > > > write of the
> > > > > > variable for a SIMD8 dispatch, so we should not prevent copy
> > > > > > propagation in
> > > > > > this scenario because we don't write all 32 bytes in the SIMD
> > > > > > register
> > > > > > or because the write starts at offset 16B (wehere we pack
> > > > > > components Y or
> > > > > > W of 16-bit vectors).
> > > > > > 
> > > > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS) of
> > > > > > 16-
> > > > > > bit
> > > > > > instructions, which lose a number of optimizations because of
> > > > > > this, most
> > > > > > important of which is copy-propagation.
> > > > > > 
> > > > > > This patch splits is_partial_write() into
> > > > > > is_partial_reg_write(),
> > > > > > which
> > > > > > represents the current is_partial_write(), useful for things
> > > > > > like
> > > > > > liveness analysis, and is_partial_var_write(), which
> > > > > > considers
> > > > > > the dispatch size to check if we are writing a full variable
> > > > > > (rather
> > > > > > than a full register) to decide if the write is partial or
> > > > > > not,
> > > > > > which
> > > > > > is what we really want in many optimization passes.
> > > > 
> > > > I actually started wondering why would liveness analysis and
> > > > register
> > > > coalescing need to treat the 16-bit SIMD8 case differently than
> > > > optimizations.
> > > > In virtual register space nothing would read or write the unused
> > > > second half
> > > > of the register in case of 16-bit type and SIMD8.
> > > 
> > > True, we might be able to use the "variable" version in more cases.
> > > I
> > > was trying to be conservative when I implemented this because I
> > > don't
> > > think the half-float CTS tests provides a good testing ground for
> > > all
> > > aspects of the compiler. I can try that and see if it breaks
> > > anything
> > > though.
> > >  
> > > > Real register allocation in turn should be orthogonal to how
> > > > things
> > > > are
> > > > allocated in virtual space. And I guess even there we wouldn't be
> > > > interested
> > > > of packing two 16-bit typed SIMD8 variables into one and same
> > > > hardware
> > > > register. It is SIMD16 where we get more pressure into register
> > > > space
> > > > anyway
> > > > and there the 16-bit typed variables occupy full registers. In
> > > > other
> > > > words,
> > > > if things fit in SIMD16, would we bother packing things more
> > > > tightly
> > > > in
> > > > SIMD8? Or even if SIMD8 was the only option, would we be
> > > > interested
> > > > packing
> > > > channels for two variables in one hw reg even then?
> > > > 
> > > > Jason, we discussed this a little in the spring time.
> > > > 
> > > > As a recap my approach shortly. Instead of ignoring the second
> > > > half
> > > > of
> > > > registers case by case I addressed it more generally:
> > > > 
> > > > - changed all the open coded checks to use helpers,
> > > > - added a padding bit into fs_reg telling about the unused space,
> > > > - change nir -> fs step to set that bit for 16-bit 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-13 Thread Iago Toral
On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
> On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> > On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
> > > wrote:
> > > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga
> > > > wrote:
> > > > > This function is used in two different scenarios that for 32-
> > > > > bit
> > > > > instructions are the same, but for 16-bit instructions are
> > > > > not.
> > > > > 
> > > > > One scenario is that in which we are working at a SIMD8
> > > > > register
> > > > > level and we need to know if a register is fully defined or
> > > > > written.
> > > > > This is useful, for example, in the context of liveness
> > > > > analysis
> > > > > or
> > > > > register allocation, where we work with units of registers.
> > > > > 
> > > > > The other scenario is that in which we want to know if an
> > > > > instruction
> > > > > is writing a full scalar component or just some subset of it.
> > > > > This is
> > > > > useful, for example, in the context of some optimization
> > > > > passes
> > > > > like copy propagation.
> > > > > 
> > > > > For 32-bit instructions (or larger), a SIMD8 dispatch will
> > > > > always
> > > > > write
> > > > > at least a full SIMD8 register (32B) if the write is not
> > > > > partial.
> > > > > The
> > > > > function is_partial_write() checks this to determine if we
> > > > > have a
> > > > > partial
> > > > > write. However, when we deal with 16-bit instructions, that
> > > > > logic
> > > > > disables
> > > > > some optimizations that should be safe. For example, a SIMD8
> > > > > 16-
> > > > > bit MOV will
> > > > > only update half of a SIMD register, but it is still a
> > > > > complete
> > > > > write of the
> > > > > variable for a SIMD8 dispatch, so we should not prevent copy
> > > > > propagation in
> > > > > this scenario because we don't write all 32 bytes in the SIMD
> > > > > register
> > > > > or because the write starts at offset 16B (wehere we pack
> > > > > components Y or
> > > > > W of 16-bit vectors).
> > > > > 
> > > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS) of
> > > > > 16-
> > > > > bit
> > > > > instructions, which lose a number of optimizations because of
> > > > > this, most
> > > > > important of which is copy-propagation.
> > > > > 
> > > > > This patch splits is_partial_write() into
> > > > > is_partial_reg_write(),
> > > > > which
> > > > > represents the current is_partial_write(), useful for things
> > > > > like
> > > > > liveness analysis, and is_partial_var_write(), which
> > > > > considers
> > > > > the dispatch size to check if we are writing a full variable
> > > > > (rather
> > > > > than a full register) to decide if the write is partial or
> > > > > not,
> > > > > which
> > > > > is what we really want in many optimization passes.
> > > 
> > > I actually started wondering why would liveness analysis and
> > > register
> > > coalescing need to treat the 16-bit SIMD8 case differently than
> > > optimizations.
> > > In virtual register space nothing would read or write the unused
> > > second half
> > > of the register in case of 16-bit type and SIMD8.
> > 
> > True, we might be able to use the "variable" version in more cases.
> > I
> > was trying to be conservative when I implemented this because I
> > don't
> > think the half-float CTS tests provides a good testing ground for
> > all
> > aspects of the compiler. I can try that and see if it breaks
> > anything
> > though.
> >  
> > > Real register allocation in turn should be orthogonal to how
> > > things
> > > are
> > > allocated in virtual space. And I guess even there we wouldn't be
> > > interested
> > > of packing two 16-bit typed SIMD8 variables into one and same
> > > hardware
> > > register. It is SIMD16 where we get more pressure into register
> > > space
> > > anyway
> > > and there the 16-bit typed variables occupy full registers. In
> > > other
> > > words,
> > > if things fit in SIMD16, would we bother packing things more
> > > tightly
> > > in
> > > SIMD8? Or even if SIMD8 was the only option, would we be
> > > interested
> > > packing
> > > channels for two variables in one hw reg even then?
> > > 
> > > Jason, we discussed this a little in the spring time.
> > > 
> > > As a recap my approach shortly. Instead of ignoring the second
> > > half
> > > of
> > > registers case by case I addressed it more generally:
> > > 
> > > - changed all the open coded checks to use helpers,
> > > - added a padding bit into fs_reg telling about the unused space,
> > > - change nir -> fs step to set that bit for 16-bit typed regs
> > > - and finally changed the helpers to consider the padding bit.
> > 
> > So if I understand how this works, you mostly make the vgrf
> > infrastructure think that half-float registers actually use twice
> > the
> > space they require by including the padding into the
> > component_size()
> > 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-12 Thread Pohjolainen, Topi
On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
> On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> > On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi wrote:
> > > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga wrote:
> > > > This function is used in two different scenarios that for 32-bit
> > > > instructions are the same, but for 16-bit instructions are not.
> > > > 
> > > > One scenario is that in which we are working at a SIMD8 register
> > > > level and we need to know if a register is fully defined or
> > > > written.
> > > > This is useful, for example, in the context of liveness analysis
> > > > or
> > > > register allocation, where we work with units of registers.
> > > > 
> > > > The other scenario is that in which we want to know if an
> > > > instruction
> > > > is writing a full scalar component or just some subset of it.
> > > > This is
> > > > useful, for example, in the context of some optimization passes
> > > > like copy propagation.
> > > > 
> > > > For 32-bit instructions (or larger), a SIMD8 dispatch will always
> > > > write
> > > > at least a full SIMD8 register (32B) if the write is not partial.
> > > > The
> > > > function is_partial_write() checks this to determine if we have a
> > > > partial
> > > > write. However, when we deal with 16-bit instructions, that logic
> > > > disables
> > > > some optimizations that should be safe. For example, a SIMD8 16-
> > > > bit MOV will
> > > > only update half of a SIMD register, but it is still a complete
> > > > write of the
> > > > variable for a SIMD8 dispatch, so we should not prevent copy
> > > > propagation in
> > > > this scenario because we don't write all 32 bytes in the SIMD
> > > > register
> > > > or because the write starts at offset 16B (wehere we pack
> > > > components Y or
> > > > W of 16-bit vectors).
> > > > 
> > > > This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-
> > > > bit
> > > > instructions, which lose a number of optimizations because of
> > > > this, most
> > > > important of which is copy-propagation.
> > > > 
> > > > This patch splits is_partial_write() into is_partial_reg_write(),
> > > > which
> > > > represents the current is_partial_write(), useful for things like
> > > > liveness analysis, and is_partial_var_write(), which considers
> > > > the dispatch size to check if we are writing a full variable
> > > > (rather
> > > > than a full register) to decide if the write is partial or not,
> > > > which
> > > > is what we really want in many optimization passes.
> > 
> > I actually started wondering why would liveness analysis and register
> > coalescing need to treat the 16-bit SIMD8 case differently than
> > optimizations.
> > In virtual register space nothing would read or write the unused
> > second half
> > of the register in case of 16-bit type and SIMD8.
> 
> True, we might be able to use the "variable" version in more cases. I
> was trying to be conservative when I implemented this because I don't
> think the half-float CTS tests provides a good testing ground for all
> aspects of the compiler. I can try that and see if it breaks anything
> though.
>  
> > Real register allocation in turn should be orthogonal to how things
> > are
> > allocated in virtual space. And I guess even there we wouldn't be
> > interested
> > of packing two 16-bit typed SIMD8 variables into one and same
> > hardware
> > register. It is SIMD16 where we get more pressure into register space
> > anyway
> > and there the 16-bit typed variables occupy full registers. In other
> > words,
> > if things fit in SIMD16, would we bother packing things more tightly
> > in
> > SIMD8? Or even if SIMD8 was the only option, would we be interested
> > packing
> > channels for two variables in one hw reg even then?
> > 
> > Jason, we discussed this a little in the spring time.
> > 
> > As a recap my approach shortly. Instead of ignoring the second half
> > of
> > registers case by case I addressed it more generally:
> > 
> > - changed all the open coded checks to use helpers,
> > - added a padding bit into fs_reg telling about the unused space,
> > - change nir -> fs step to set that bit for 16-bit typed regs
> > - and finally changed the helpers to consider the padding bit.
> 
> So if I understand how this works, you mostly make the vgrf
> infrastructure think that half-float registers actually use twice the
> space they require by including the padding into the component_size()
> helper, right? (We also need to cover 8-bit by the way, but I suppose
> it just means increasing the padding we use for 8-bit variables). And
> then you modify is_partial_write() to use that helper so it accounts
> for that padding and just check if the (padded) size is less than a
> full SIMD register. Basically, we are making all smaller types look
> like they use 32-bit per channel in the vgrf space.

Actually not 32-bit per channel but that there is unused data towards the
end of the register, 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-12 Thread Iago Toral
On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
> On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi wrote:
> > On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga wrote:
> > > This function is used in two different scenarios that for 32-bit
> > > instructions are the same, but for 16-bit instructions are not.
> > > 
> > > One scenario is that in which we are working at a SIMD8 register
> > > level and we need to know if a register is fully defined or
> > > written.
> > > This is useful, for example, in the context of liveness analysis
> > > or
> > > register allocation, where we work with units of registers.
> > > 
> > > The other scenario is that in which we want to know if an
> > > instruction
> > > is writing a full scalar component or just some subset of it.
> > > This is
> > > useful, for example, in the context of some optimization passes
> > > like copy propagation.
> > > 
> > > For 32-bit instructions (or larger), a SIMD8 dispatch will always
> > > write
> > > at least a full SIMD8 register (32B) if the write is not partial.
> > > The
> > > function is_partial_write() checks this to determine if we have a
> > > partial
> > > write. However, when we deal with 16-bit instructions, that logic
> > > disables
> > > some optimizations that should be safe. For example, a SIMD8 16-
> > > bit MOV will
> > > only update half of a SIMD register, but it is still a complete
> > > write of the
> > > variable for a SIMD8 dispatch, so we should not prevent copy
> > > propagation in
> > > this scenario because we don't write all 32 bytes in the SIMD
> > > register
> > > or because the write starts at offset 16B (wehere we pack
> > > components Y or
> > > W of 16-bit vectors).
> > > 
> > > This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-
> > > bit
> > > instructions, which lose a number of optimizations because of
> > > this, most
> > > important of which is copy-propagation.
> > > 
> > > This patch splits is_partial_write() into is_partial_reg_write(),
> > > which
> > > represents the current is_partial_write(), useful for things like
> > > liveness analysis, and is_partial_var_write(), which considers
> > > the dispatch size to check if we are writing a full variable
> > > (rather
> > > than a full register) to decide if the write is partial or not,
> > > which
> > > is what we really want in many optimization passes.
> 
> I actually started wondering why would liveness analysis and register
> coalescing need to treat the 16-bit SIMD8 case differently than
> optimizations.
> In virtual register space nothing would read or write the unused
> second half
> of the register in case of 16-bit type and SIMD8.

True, we might be able to use the "variable" version in more cases. I
was trying to be conservative when I implemented this because I don't
think the half-float CTS tests provides a good testing ground for all
aspects of the compiler. I can try that and see if it breaks anything
though.
 
> Real register allocation in turn should be orthogonal to how things
> are
> allocated in virtual space. And I guess even there we wouldn't be
> interested
> of packing two 16-bit typed SIMD8 variables into one and same
> hardware
> register. It is SIMD16 where we get more pressure into register space
> anyway
> and there the 16-bit typed variables occupy full registers. In other
> words,
> if things fit in SIMD16, would we bother packing things more tightly
> in
> SIMD8? Or even if SIMD8 was the only option, would we be interested
> packing
> channels for two variables in one hw reg even then?
> 
> Jason, we discussed this a little in the spring time.
> 
> As a recap my approach shortly. Instead of ignoring the second half
> of
> registers case by case I addressed it more generally:
> 
> - changed all the open coded checks to use helpers,
> - added a padding bit into fs_reg telling about the unused space,
> - change nir -> fs step to set that bit for 16-bit typed regs
> - and finally changed the helpers to consider the padding bit.

So if I understand how this works, you mostly make the vgrf
infrastructure think that half-float registers actually use twice the
space they require by including the padding into the component_size()
helper, right? (We also need to cover 8-bit by the way, but I suppose
it just means increasing the padding we use for 8-bit variables). And
then you modify is_partial_write() to use that helper so it accounts
for that padding and just check if the (padded) size is less than a
full SIMD register. Basically, we are making all smaller types look
like they use 32-bit per channel in the vgrf space.

One thing that worries me a bit is that we are faking the real size of
things and that can lead to parts of the compiler that rely on these
helpers to provide the real size to do things we don't want. For
example: are you limiting this to SIMD8 only? I ask because if we do
this for SIMD16 16-bit dispatches then we'd computing a size of two
registers for things that 

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-11 Thread Pohjolainen, Topi
On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi wrote:
> On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga wrote:
> > This function is used in two different scenarios that for 32-bit
> > instructions are the same, but for 16-bit instructions are not.
> > 
> > One scenario is that in which we are working at a SIMD8 register
> > level and we need to know if a register is fully defined or written.
> > This is useful, for example, in the context of liveness analysis or
> > register allocation, where we work with units of registers.
> > 
> > The other scenario is that in which we want to know if an instruction
> > is writing a full scalar component or just some subset of it. This is
> > useful, for example, in the context of some optimization passes
> > like copy propagation.
> > 
> > For 32-bit instructions (or larger), a SIMD8 dispatch will always write
> > at least a full SIMD8 register (32B) if the write is not partial. The
> > function is_partial_write() checks this to determine if we have a partial
> > write. However, when we deal with 16-bit instructions, that logic disables
> > some optimizations that should be safe. For example, a SIMD8 16-bit MOV will
> > only update half of a SIMD register, but it is still a complete write of the
> > variable for a SIMD8 dispatch, so we should not prevent copy propagation in
> > this scenario because we don't write all 32 bytes in the SIMD register
> > or because the write starts at offset 16B (wehere we pack components Y or
> > W of 16-bit vectors).
> > 
> > This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit
> > instructions, which lose a number of optimizations because of this, most
> > important of which is copy-propagation.
> > 
> > This patch splits is_partial_write() into is_partial_reg_write(), which
> > represents the current is_partial_write(), useful for things like
> > liveness analysis, and is_partial_var_write(), which considers
> > the dispatch size to check if we are writing a full variable (rather
> > than a full register) to decide if the write is partial or not, which
> > is what we really want in many optimization passes.

I actually started wondering why would liveness analysis and register
coalescing need to treat the 16-bit SIMD8 case differently than optimizations.
In virtual register space nothing would read or write the unused second half
of the register in case of 16-bit type and SIMD8.
Real register allocation in turn should be orthogonal to how things are
allocated in virtual space. And I guess even there we wouldn't be interested
of packing two 16-bit typed SIMD8 variables into one and same hardware
register. It is SIMD16 where we get more pressure into register space anyway
and there the 16-bit typed variables occupy full registers. In other words,
if things fit in SIMD16, would we bother packing things more tightly in
SIMD8? Or even if SIMD8 was the only option, would we be interested packing
channels for two variables in one hw reg even then?

Jason, we discussed this a little in the spring time.

As a recap my approach shortly. Instead of ignoring the second half of
registers case by case I addressed it more generally:

- changed all the open coded checks to use helpers,
- added a padding bit into fs_reg telling about the unused space,
- change nir -> fs step to set that bit for 16-bit typed regs
- and finally changed the helpers to consider the padding bit.

Now, I'm fine doing this case by case the way it is done here. I'm just
wondering if the split is needed, i.e., considering in some cases 16-bit SIMD8
virtual registers as half written and in some cases just ignoring the fact.

> > 
> > Then the patch goes on and rewrites all uses of is_partial_write() to use
> > one or the other version. Specifically, we use is_partial_var_write()
> > in the following places: copy propagation, cmod propagation, common
> > subexpression elimination, saturate propagation and sel peephole.
> > 
> > Notice that the semantics of is_partial_var_write() exactly match the
> > current implementation of is_partial_write() for anything that is
> > 32-bit or larger, so no changes are expected for 32-bit instructions.
> > 
> > Tested against ~5000 tests involving 16-bit instructions in CTS produced
> > the following changes in instruction counts:
> > 
> > Patched  | Master|%|
> > 
> > SIMD8  |621,900  |706,721| -12.00% |
> > 
> > SIMD16 | 93,252  | 93,252|   0.00% |
> > 
> > 
> > As expected, the change only affects SIMD8 dispatches.
> 
> I like this. But I think I want to try and rebase my fp16 work on top to see
> if there are any differences in the final assembly between this and my
> "register padding" scheme.
> 
> > ---
> >  src/intel/compiler/brw_fs.cpp | 31 +++
> >  

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-07 Thread Pohjolainen, Topi
On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga wrote:
> This function is used in two different scenarios that for 32-bit
> instructions are the same, but for 16-bit instructions are not.
> 
> One scenario is that in which we are working at a SIMD8 register
> level and we need to know if a register is fully defined or written.
> This is useful, for example, in the context of liveness analysis or
> register allocation, where we work with units of registers.
> 
> The other scenario is that in which we want to know if an instruction
> is writing a full scalar component or just some subset of it. This is
> useful, for example, in the context of some optimization passes
> like copy propagation.
> 
> For 32-bit instructions (or larger), a SIMD8 dispatch will always write
> at least a full SIMD8 register (32B) if the write is not partial. The
> function is_partial_write() checks this to determine if we have a partial
> write. However, when we deal with 16-bit instructions, that logic disables
> some optimizations that should be safe. For example, a SIMD8 16-bit MOV will
> only update half of a SIMD register, but it is still a complete write of the
> variable for a SIMD8 dispatch, so we should not prevent copy propagation in
> this scenario because we don't write all 32 bytes in the SIMD register
> or because the write starts at offset 16B (wehere we pack components Y or
> W of 16-bit vectors).
> 
> This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit
> instructions, which lose a number of optimizations because of this, most
> important of which is copy-propagation.
> 
> This patch splits is_partial_write() into is_partial_reg_write(), which
> represents the current is_partial_write(), useful for things like
> liveness analysis, and is_partial_var_write(), which considers
> the dispatch size to check if we are writing a full variable (rather
> than a full register) to decide if the write is partial or not, which
> is what we really want in many optimization passes.
> 
> Then the patch goes on and rewrites all uses of is_partial_write() to use
> one or the other version. Specifically, we use is_partial_var_write()
> in the following places: copy propagation, cmod propagation, common
> subexpression elimination, saturate propagation and sel peephole.
> 
> Notice that the semantics of is_partial_var_write() exactly match the
> current implementation of is_partial_write() for anything that is
> 32-bit or larger, so no changes are expected for 32-bit instructions.
> 
> Tested against ~5000 tests involving 16-bit instructions in CTS produced
> the following changes in instruction counts:
> 
> Patched  | Master|%|
> 
> SIMD8  |621,900  |706,721| -12.00% |
> 
> SIMD16 | 93,252  | 93,252|   0.00% |
> 
> 
> As expected, the change only affects SIMD8 dispatches.

I like this. But I think I want to try and rebase my fp16 work on top to see
if there are any differences in the final assembly between this and my
"register padding" scheme.

> ---
>  src/intel/compiler/brw_fs.cpp | 31 +++
>  .../compiler/brw_fs_cmod_propagation.cpp  | 20 ++--
>  .../compiler/brw_fs_copy_propagation.cpp  |  8 ++---
>  src/intel/compiler/brw_fs_cse.cpp |  3 +-
>  .../compiler/brw_fs_dead_code_eliminate.cpp   |  2 +-
>  src/intel/compiler/brw_fs_live_variables.cpp  |  2 +-
>  src/intel/compiler/brw_fs_reg_allocate.cpp|  2 +-
>  .../compiler/brw_fs_register_coalesce.cpp |  2 +-
>  .../compiler/brw_fs_saturate_propagation.cpp  |  7 +++--
>  src/intel/compiler/brw_fs_sel_peephole.cpp|  4 +--
>  src/intel/compiler/brw_ir_fs.h|  3 +-
>  11 files changed, 54 insertions(+), 30 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 1d5d1dd0d22..9ea67975e1e 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -698,14 +698,33 @@ fs_visitor::limit_dispatch_width(unsigned n, const char 
> *msg)
>   * it.
>   */
>  bool
> -fs_inst::is_partial_write() const
> +fs_inst::is_partial_reg_write() const
>  {
> return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
> -   (this->exec_size * type_sz(this->dst.type)) < 32 ||
> !this->dst.is_contiguous() ||
> +   (this->exec_size * type_sz(this->dst.type)) < REG_SIZE ||
> this->dst.offset % REG_SIZE != 0);
>  }
>  
> +/**
> + * Returns true if the instruction has a flag that means it won't
> + * update an entire variable for the given dispatch width.
> + *
> + * This is only different from is_partial_reg_write() for SIMD8
> + * dispatches of 16-bit (or smaller) instructions.
> + */
> +bool
> +fs_inst::is_partial_var_write(uint32_t dispatch_width) const
> +{
> +   const uint32_t type_size 

[Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

2018-12-03 Thread Iago Toral Quiroga
This function is used in two different scenarios that for 32-bit
instructions are the same, but for 16-bit instructions are not.

One scenario is that in which we are working at a SIMD8 register
level and we need to know if a register is fully defined or written.
This is useful, for example, in the context of liveness analysis or
register allocation, where we work with units of registers.

The other scenario is that in which we want to know if an instruction
is writing a full scalar component or just some subset of it. This is
useful, for example, in the context of some optimization passes
like copy propagation.

For 32-bit instructions (or larger), a SIMD8 dispatch will always write
at least a full SIMD8 register (32B) if the write is not partial. The
function is_partial_write() checks this to determine if we have a partial
write. However, when we deal with 16-bit instructions, that logic disables
some optimizations that should be safe. For example, a SIMD8 16-bit MOV will
only update half of a SIMD register, but it is still a complete write of the
variable for a SIMD8 dispatch, so we should not prevent copy propagation in
this scenario because we don't write all 32 bytes in the SIMD register
or because the write starts at offset 16B (wehere we pack components Y or
W of 16-bit vectors).

This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit
instructions, which lose a number of optimizations because of this, most
important of which is copy-propagation.

This patch splits is_partial_write() into is_partial_reg_write(), which
represents the current is_partial_write(), useful for things like
liveness analysis, and is_partial_var_write(), which considers
the dispatch size to check if we are writing a full variable (rather
than a full register) to decide if the write is partial or not, which
is what we really want in many optimization passes.

Then the patch goes on and rewrites all uses of is_partial_write() to use
one or the other version. Specifically, we use is_partial_var_write()
in the following places: copy propagation, cmod propagation, common
subexpression elimination, saturate propagation and sel peephole.

Notice that the semantics of is_partial_var_write() exactly match the
current implementation of is_partial_write() for anything that is
32-bit or larger, so no changes are expected for 32-bit instructions.

Tested against ~5000 tests involving 16-bit instructions in CTS produced
the following changes in instruction counts:

Patched  | Master|%|

SIMD8  |621,900  |706,721| -12.00% |

SIMD16 | 93,252  | 93,252|   0.00% |


As expected, the change only affects SIMD8 dispatches.
---
 src/intel/compiler/brw_fs.cpp | 31 +++
 .../compiler/brw_fs_cmod_propagation.cpp  | 20 ++--
 .../compiler/brw_fs_copy_propagation.cpp  |  8 ++---
 src/intel/compiler/brw_fs_cse.cpp |  3 +-
 .../compiler/brw_fs_dead_code_eliminate.cpp   |  2 +-
 src/intel/compiler/brw_fs_live_variables.cpp  |  2 +-
 src/intel/compiler/brw_fs_reg_allocate.cpp|  2 +-
 .../compiler/brw_fs_register_coalesce.cpp |  2 +-
 .../compiler/brw_fs_saturate_propagation.cpp  |  7 +++--
 src/intel/compiler/brw_fs_sel_peephole.cpp|  4 +--
 src/intel/compiler/brw_ir_fs.h|  3 +-
 11 files changed, 54 insertions(+), 30 deletions(-)

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index 1d5d1dd0d22..9ea67975e1e 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -698,14 +698,33 @@ fs_visitor::limit_dispatch_width(unsigned n, const char 
*msg)
  * it.
  */
 bool
-fs_inst::is_partial_write() const
+fs_inst::is_partial_reg_write() const
 {
return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
-   (this->exec_size * type_sz(this->dst.type)) < 32 ||
!this->dst.is_contiguous() ||
+   (this->exec_size * type_sz(this->dst.type)) < REG_SIZE ||
this->dst.offset % REG_SIZE != 0);
 }
 
+/**
+ * Returns true if the instruction has a flag that means it won't
+ * update an entire variable for the given dispatch width.
+ *
+ * This is only different from is_partial_reg_write() for SIMD8
+ * dispatches of 16-bit (or smaller) instructions.
+ */
+bool
+fs_inst::is_partial_var_write(uint32_t dispatch_width) const
+{
+   const uint32_t type_size = type_sz(this->dst.type);
+   uint32_t var_size = MIN2(REG_SIZE, dispatch_width * type_size);
+
+   return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
+   !this->dst.is_contiguous() ||
+   (this->exec_size * type_sz(this->dst.type)) < var_size ||
+   this->dst.offset % var_size != 0);
+}
+
 unsigned
 fs_inst::components_read(unsigned i) const
 {
@@ -2847,7 +2866,7 @@ fs_visitor::opt_register_renaming()