[gem5-dev] Re: overview/documentation/tests for vector register related stuff?

2021-08-09 Thread Gabe Black via gem5-dev
On Mon, Aug 9, 2021 at 1:02 PM Giacomo Travaglini <
giacomo.travagl...@arm.com> wrote:

> Hi Gabe,
>
> > -Original Message-
> > From: Gabe Black via gem5-dev 
> > Sent: 09 August 2021 11:02
> > To: gem5 Developer List 
> > Cc: Gabe Black 
> > Subject: [gem5-dev] Re: overview/documentation/tests for vector register
> > related stuff?
> >
> > I've done a bit of digging so far, and I think I've figured out a bit
> about the
> > rename mode.
> >
> > 1. This is only used by ARM to handle the difference in how registers are
> > renamed in aarch64 vs otherwise.
> > 2. This is handled in O3 by detecting a squash in the CPU and then
> checking
> > the aarch64 bit of the PCState.
> > 3. If this changes, then O3 potentially shuffles things around to make
> register
> > chunks contiguous, and starts renaming things differently.
> > 4. The only way to switch in or out of aarch64 is through a fault.
>
> Yes, just to be more precise, it is happening by issuing a fault or by
> returning from a fault (this is just to make it clear
> the switch can happen with a non faulting instruction like an ERET)
>
> >
> > This leads me to a few conclusions.
> >
> > 1. Having the aarch64 bit in the PCState structure is probably not
> necessary
> > and may actually be harmful because it makes that structure larger and
> > slower to move around. This value does *not* change quickly or
> frequently,
> > and only changes as part of an already heavy mode switch. It does not
> need
> > to be predicted/predictable like a next PC, like something like thumb
> mode
> > might.
>
> You can figure out the execution mode (AArch64/AArch32) in a different way
> by inspecting
> The PSTATE. So I can see the redundancy. However, inspecting the
> PSTATE/CPSR from the TC is probably not
> going to be faster. We need to know the aarch64 in the decoder, so I guess
> we could cache it in there.
>
> In any case IMO I don't think removing it from the PCState is gonna affect
> in any measurable way simulation time.
>

Yeah, I think this is mostly orthogonal, I just wanted to bring it up since
I noticed it.


>
> > 2. The O3 CPU is checking renaming mode *way* more often than it really
> > needs to. Almost every single squash is *not* a switch to/from 64 bit
> mode,
> > but *every* switch involves that check, even in ISAs that don't even
> *have*
> > rename modes.
> > 3. The rename semantics switch can be handled right in the fault object
> when
> > it implements the faulting context switch. It can detect that a switch is
> > necessary and enact it without all the extra checks.
>
> Totally agree on point 2. About point 3, yes you could handle it in the
> fault object and in the ERET instruction.
> That would mean leaking uarch code in the arch directory. In other words,
> having some *O3 specific* code in
> The arch directory. This is not ideal IMHO as it is bounding the arch code
> to a single cpu model
>

Please see my CL here:
https://gem5-review.googlesource.com/c/public/gem5/+/49147

I don't think it brings uarch code into the ISA implementation. What it
does is reestablish the invariant that registers are atomic blobs which
have no structure to the CPU, and then builds the different indexing views
into the ARM implementation instead. This is the way it used to work where
an ISA would compose composite registers by reading in their parts. I would
say this actually brings *less* uarch implementation into the ISA than
before, since now the ISA doesn't need to worry about rename modes, or that
there is even a rename step. It just has to maintain the invariant that
registers are atomic blobs as far as the CPU is concerned, and build
whatever other semantics it needs on top of that.


>
> > 4. ARM can implement SVE, etc, using two different register files, one
> which
> > is indexed by element for 32 bit mode, and one which is indexed by vector
> > for 64 bit mode. The mode switch can copy values between the register
> files,
> > and we can remove what I suspect is a lot of machinery from O3 by just
> > letting it manage two different register files simply, instead of
> managing one
> > with two different personalities. This also makes the register files
> much more
> > homogenous and easier to generalize. A "real" CPU may not want to waste
> > transistors, buses, etc, for two separate register files, but in the end
> it
> > doesn't matter if the behavior is the same. This is all just in how O3
> does its
> > bookkeeping, and a redundant register file is nearly free for gem5.
> >
>
> I would love to see a cleaner implementation! But 

[gem5-dev] Re: overview/documentation/tests for vector register related stuff?

2021-08-09 Thread Giacomo Travaglini via gem5-dev
Hi Gabe,

> -Original Message-
> From: Gabe Black via gem5-dev 
> Sent: 09 August 2021 11:02
> To: gem5 Developer List 
> Cc: Gabe Black 
> Subject: [gem5-dev] Re: overview/documentation/tests for vector register
> related stuff?
>
> I've done a bit of digging so far, and I think I've figured out a bit about 
> the
> rename mode.
>
> 1. This is only used by ARM to handle the difference in how registers are
> renamed in aarch64 vs otherwise.
> 2. This is handled in O3 by detecting a squash in the CPU and then checking
> the aarch64 bit of the PCState.
> 3. If this changes, then O3 potentially shuffles things around to make 
> register
> chunks contiguous, and starts renaming things differently.
> 4. The only way to switch in or out of aarch64 is through a fault.

Yes, just to be more precise, it is happening by issuing a fault or by 
returning from a fault (this is just to make it clear
the switch can happen with a non faulting instruction like an ERET)

>
> This leads me to a few conclusions.
>
> 1. Having the aarch64 bit in the PCState structure is probably not necessary
> and may actually be harmful because it makes that structure larger and
> slower to move around. This value does *not* change quickly or frequently,
> and only changes as part of an already heavy mode switch. It does not need
> to be predicted/predictable like a next PC, like something like thumb mode
> might.

You can figure out the execution mode (AArch64/AArch32) in a different way by 
inspecting
The PSTATE. So I can see the redundancy. However, inspecting the PSTATE/CPSR 
from the TC is probably not
going to be faster. We need to know the aarch64 in the decoder, so I guess we 
could cache it in there.

In any case IMO I don't think removing it from the PCState is gonna affect in 
any measurable way simulation time.

> 2. The O3 CPU is checking renaming mode *way* more often than it really
> needs to. Almost every single squash is *not* a switch to/from 64 bit mode,
> but *every* switch involves that check, even in ISAs that don't even *have*
> rename modes.
> 3. The rename semantics switch can be handled right in the fault object when
> it implements the faulting context switch. It can detect that a switch is
> necessary and enact it without all the extra checks.

Totally agree on point 2. About point 3, yes you could handle it in the fault 
object and in the ERET instruction.
That would mean leaking uarch code in the arch directory. In other words, 
having some *O3 specific* code in
The arch directory. This is not ideal IMHO as it is bounding the arch code to a 
single cpu model

> 4. ARM can implement SVE, etc, using two different register files, one which
> is indexed by element for 32 bit mode, and one which is indexed by vector
> for 64 bit mode. The mode switch can copy values between the register files,
> and we can remove what I suspect is a lot of machinery from O3 by just
> letting it manage two different register files simply, instead of managing one
> with two different personalities. This also makes the register files much more
> homogenous and easier to generalize. A "real" CPU may not want to waste
> transistors, buses, etc, for two separate register files, but in the end it
> doesn't matter if the behavior is the same. This is all just in how O3 does 
> its
> bookkeeping, and a redundant register file is nearly free for gem5.
>

I would love to see a cleaner implementation! But I am not entirely sure your 
solution is much different from what we are having now:
Sure there is only one storage [1] but all remaining data structures are 
duplicated (check veRegIds and vecElemIds as an example, or the vecElem/vecReg 
freeLists [2]).
In fact, we are already copying values from one register file to the other when 
switching from Rename::Full to Rename::Elem [3].
I honestly believe having two different regfiles is the source of all our 
problems as it is forcing us to switch/copy values when a
Change in rename happens. What the implementation should have been like, is one 
single set of vector data structures with 2 different views.
No synchronization needed; AArch32 use the Enum view and AArch64 use the Full 
view.

> Please let me know if this is correct, and I'll start chopping away. Some way 
> to
> test my changes would be very helpful, since otherwise I'll just be hoping for
> the best :-P.

I would recommend you to cross-compile a FP application for AArch32 and 
execute it on a AArch64 Linux kernel (with syscalls to make sure
we change rename mode and we don't rely on the intervention of the scheduler). 
You could even cross-compile the same source for AArch64 and
execute it as a separate process, and OFC to multiplex them on the same CPU.

>
> Gabe

Kind Regards

Giacomo

[1]: https://github.com/gem5/gem5/blob/stable/src/cpu/o3/regfile.hh

[gem5-dev] Re: overview/documentation/tests for vector register related stuff?

2021-08-09 Thread Gabe Black via gem5-dev
I've done a bit of digging so far, and I think I've figured out a bit about
the rename mode.

1. This is only used by ARM to handle the difference in how registers are
renamed in aarch64 vs otherwise.
2. This is handled in O3 by detecting a squash in the CPU and then checking
the aarch64 bit of the PCState.
3. If this changes, then O3 potentially shuffles things around to make
register chunks contiguous, and starts renaming things differently.
4. The only way to switch in or out of aarch64 is through a fault.

This leads me to a few conclusions.

1. Having the aarch64 bit in the PCState structure is probably not
necessary and may actually be harmful because it makes that structure
larger and slower to move around. This value does *not* change quickly or
frequently, and only changes as part of an already heavy mode switch. It
does not need to be predicted/predictable like a next PC, like something
like thumb mode might.
2. The O3 CPU is checking renaming mode *way* more often than it really
needs to. Almost every single squash is *not* a switch to/from 64 bit mode,
but *every* switch involves that check, even in ISAs that don't even *have*
rename modes.
3. The rename semantics switch can be handled right in the fault object
when it implements the faulting context switch. It can detect that a switch
is necessary and enact it without all the extra checks.
4. ARM can implement SVE, etc, using two different register files, one
which is indexed by element for 32 bit mode, and one which is indexed by
vector for 64 bit mode. The mode switch can copy values between the
register files, and we can remove what I suspect is a lot of machinery from
O3 by just letting it manage two different register files simply, instead
of managing one with two different personalities. This also makes the
register files much more homogenous and easier to generalize. A "real" CPU
may not want to waste transistors, buses, etc, for two separate register
files, but in the end it doesn't matter if the behavior is the same. This
is all just in how O3 does its bookkeeping, and a redundant register file
is nearly free for gem5.

Please let me know if this is correct, and I'll start chopping away. Some
way to test my changes would be very helpful, since otherwise I'll just be
hoping for the best :-P.

Gabe
___
gem5-dev mailing list -- gem5-dev@gem5.org
To unsubscribe send an email to gem5-dev-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s