Re: [m5-dev] Condition code bits in X86 O3

2011-02-13 Thread Steve Reinhardt
Hi Gabe,

I just got around to reading this... please fill me in with more design
details as you work on this, as I'd like to keep on top of what you're doing
and (perhaps) be in a position to offer some suggestions.

Thanks,

Steve

On Fri, Feb 11, 2011 at 4:16 PM, Gabriel Michael Black 
gbl...@eecs.umich.edu wrote:

 Hello again. I've had a chance to talk with an expert, and I have an idea
 of how to approach this. It's going to require more flexibility than the ISA
 parser has currently, though, specifically in how the list of source and
 destination registers are managed. It would also be nice to have a more
 integrated idea of composite operands, ie. ones where some bits come from
 here, some from there, and in the end it builds a single uint64_t, double
 precision float, vector of uint32_ts, etc.

 Rather than try to shoe horn this into a system that's already suffered
 enough of my abuse, aka the ISA description language, I'm going to attempt
 to build a parallel facility for defining instructions usable from inside
 the python in let blocks. Basically it would be python classes, functions,
 etc., (hopefully not that many) exported into the let block context that
 would allow more direct interaction with the parser's guts, and more control
 over how things are put together.

 In the future I'd like to see this bud into isa_parser2.py, but that's
 going to be a lot of work and is a somewhat orthogonal issue. Ideally this
 sort of thing will also make it easier to split output into smaller files.

 Gabe


 Quoting Gabe Black gbl...@eecs.umich.edu:

  I'm looking at why x86 goes so much slower than Alpha on O3 (4x the
 ticks), and I think one culprit are dependencies set up by the condition
 code bits of the flags register. Many instructions in x86 modify or
 depend on those bits, and even though the condition codes are separated
 out from the flags register (which does a lot of other stuff too),
 they're being updated with a read-modify-write sort of mechanism. I
 expect that's setting up long chains of serializing dependencies which
 is killing parallelism and performance.

 Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero,
 carry, auxiliary carry, sign, parity and overflow. In M5's
 implementation (and in the patent I patterned it after) there are also
 artificial emulation zero and carry flags that work like the regular
 ones but are maintained separately. They can be updated independently
 and checked separately, and are useful behind the scenes when
 implementing some macroops. Instructions may update all of these flags
 or only some of them. The PTLSim manual claims that there's a ZAPS
 rule where the zero, auxiliary carry, parity and sign bits are always
 updated together. That's usually true, but certain instructions change
 only the zero flag. CMPXCHG8B is an example.

 What I'd been thinking of doing to handle this is to further split up
 the condition code bits into separate registers to be managed
 independently for any register renaming. There are a couple of issues
 with that, though. First, it looks like there'd have to be 6 different
 registers, APS, Z, O, C, EZ, and EC. A non-trivial number of
 instructions would need to update 4 or more of those, putting a perhaps
 unrealistic burden on any rename mechanism. That would also make the
 simple CPUs slower because they'd have to read/write all those extra
 registers. Bread and butter x86 tends to be condition code happy, so
 that could be a significant slow down.

 Also, that complicates decoding significantly. Conceptually it's easy to
 imagine reading/writing the registers with the bits you need, but with
 the ISA parser, the code needs to either be there or not be there. If
 you have code that's never used but accesses a register, it'll still get
 pulled in as a source or dest. That means there would need to be a hard
 coded version of every microop that would correspond to each possible
 combination of condition code bits. Since there are 6 bits, that's 2^6,
 plus 2 variants for partial or complete register writes, so 2^7 or 128
 versions of every microop. There are also register/immediate versions of
 many microops. We would likely end up with thousands of microop classes.
 We'd also need to generate selection functions that would pick which
 variant to use. This is all possible, but fairly ugly and clunky.

 So does anybody have any suggestions on how to unserialize these
 microops? I found a paper here:
 http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf
 that claims IPC for x86 CPUs is significantly worse than other ISAs
 specifically because of this sort of thing. Is this just a fact of life
 with x86? Would fixing it be not only very annoying but also
 unrealistic? Is that paper's claim actually true?

 Gabe
 ___
 m5-dev mailing list
 m5-dev@m5sim.org
 http://m5sim.org/mailman/listinfo/m5-dev



 

Re: [m5-dev] Condition code bits in X86 O3

2011-02-11 Thread Gabriel Michael Black
Hello again. I've had a chance to talk with an expert, and I have an  
idea of how to approach this. It's going to require more flexibility  
than the ISA parser has currently, though, specifically in how the  
list of source and destination registers are managed. It would also be  
nice to have a more integrated idea of composite operands, ie. ones  
where some bits come from here, some from there, and in the end it  
builds a single uint64_t, double precision float, vector of uint32_ts,  
etc.


Rather than try to shoe horn this into a system that's already  
suffered enough of my abuse, aka the ISA description language, I'm  
going to attempt to build a parallel facility for defining  
instructions usable from inside the python in let blocks. Basically  
it would be python classes, functions, etc., (hopefully not that many)  
exported into the let block context that would allow more direct  
interaction with the parser's guts, and more control over how things  
are put together.


In the future I'd like to see this bud into isa_parser2.py, but that's  
going to be a lot of work and is a somewhat orthogonal issue. Ideally  
this sort of thing will also make it easier to split output into  
smaller files.


Gabe

Quoting Gabe Black gbl...@eecs.umich.edu:


I'm looking at why x86 goes so much slower than Alpha on O3 (4x the
ticks), and I think one culprit are dependencies set up by the condition
code bits of the flags register. Many instructions in x86 modify or
depend on those bits, and even though the condition codes are separated
out from the flags register (which does a lot of other stuff too),
they're being updated with a read-modify-write sort of mechanism. I
expect that's setting up long chains of serializing dependencies which
is killing parallelism and performance.

Basically, There are 6 condition codes in x86, Z, C, A, S, P, O or zero,
carry, auxiliary carry, sign, parity and overflow. In M5's
implementation (and in the patent I patterned it after) there are also
artificial emulation zero and carry flags that work like the regular
ones but are maintained separately. They can be updated independently
and checked separately, and are useful behind the scenes when
implementing some macroops. Instructions may update all of these flags
or only some of them. The PTLSim manual claims that there's a ZAPS
rule where the zero, auxiliary carry, parity and sign bits are always
updated together. That's usually true, but certain instructions change
only the zero flag. CMPXCHG8B is an example.

What I'd been thinking of doing to handle this is to further split up
the condition code bits into separate registers to be managed
independently for any register renaming. There are a couple of issues
with that, though. First, it looks like there'd have to be 6 different
registers, APS, Z, O, C, EZ, and EC. A non-trivial number of
instructions would need to update 4 or more of those, putting a perhaps
unrealistic burden on any rename mechanism. That would also make the
simple CPUs slower because they'd have to read/write all those extra
registers. Bread and butter x86 tends to be condition code happy, so
that could be a significant slow down.

Also, that complicates decoding significantly. Conceptually it's easy to
imagine reading/writing the registers with the bits you need, but with
the ISA parser, the code needs to either be there or not be there. If
you have code that's never used but accesses a register, it'll still get
pulled in as a source or dest. That means there would need to be a hard
coded version of every microop that would correspond to each possible
combination of condition code bits. Since there are 6 bits, that's 2^6,
plus 2 variants for partial or complete register writes, so 2^7 or 128
versions of every microop. There are also register/immediate versions of
many microops. We would likely end up with thousands of microop classes.
We'd also need to generate selection functions that would pick which
variant to use. This is all possible, but fairly ugly and clunky.

So does anybody have any suggestions on how to unserialize these
microops? I found a paper here:
http://www.wseas.us/e-library/conferences/2006elounda1/papers/537-325.pdf
that claims IPC for x86 CPUs is significantly worse than other ISAs
specifically because of this sort of thing. Is this just a fact of life
with x86? Would fixing it be not only very annoying but also
unrealistic? Is that paper's claim actually true?

Gabe
___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev




___
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev