[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-22 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #18 from Bill Schmidt  ---
I asked around a bit.  On x86, user-user attacks are not mitigated by default. 
To enable user-user mitigation:

echo 2 > /sys/kernel/debug/x86/ibrs_enabled

My source tells me:

8<---

Red Hat explains the above setting as follows in
https://access.redhat.com/articles/3311301 -

"When IBRS is set to 2 (spectre_v2=ibrs_always), both userland and kernel runs
with indirect branch restricted speculation. This protects userspace from
hyperthreading/simultaneous multi-threading attacks as well, and is also the
default on certain old AMD processors (family 10h, 12h and 16h). This feature
addresses CVE-2017-5715, variant #2."

If a GCC compiler with support for "thunks" is available, one might also build
their applications, for example, PHP with the following flags added to mitigate
spectre variant #2-
-mindirect-branch=thunk-inline -mfunction-return=thunk-inline
-mindirect-branch-register

However, it is possible that to properly mitigate spectre variant#2 in Skylake
processors, setting ibrs_enabled to 2 AND using thunks may be necessary,
although I am not sure about this.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-20 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #17 from Bill Schmidt  ---
OK, thanks!  I'd be very interested in hearing what you discover.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-20 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #16 from Timothy Pearson  ---
(In reply to Bill Schmidt from comment #15)
> PHP's reliance on frequent indirect branches makes it essentially the worst
> case for this sort of thing.  When Spectre v2 CVE mitigations are in place
> for user code, you will see performance issues on all architectures that
> rely on speculation for indirect branch performance.  When user code is
> running in an "unsafe" configuration, you will not see those issues.  (We
> have seen similar issues on x86 when retpoline is used for user code.)

What's most puzzling is that we're looking at benchmarks on x86 systems that
are supposed to be mitigated, but the performance drop isn't really showing up.
 At this point I'm wondering if:

a.) The user/user attack isn't actually mitigated on these systems, only the
user/kernel attack

b.) Intel/AMD found some way to update the microcode so as not to have a heavy
performance loss

In any case, we'll continue to investigate / run benchmarks to see if any light
can be shed on this.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-20 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #15 from Bill Schmidt  ---
PHP's reliance on frequent indirect branches makes it essentially the worst
case for this sort of thing.  When Spectre v2 CVE mitigations are in place for
user code, you will see performance issues on all architectures that rely on
speculation for indirect branch performance.  When user code is running in an
"unsafe" configuration, you will not see those issues.  (We have seen similar
issues on x86 when retpoline is used for user code.)

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-20 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #14 from Timothy Pearson  ---
(In reply to Bill Schmidt from comment #13)
> This was prototyped and measured against the firmware fixes with
> indistinguishable results.  So the complexity of a software solution, with
> its impacts on Linux distributions, was not warranted.  (That is, the
> firmware workarounds are already tightly targeted.)

Good to know, thank you.  Was about to test it on this end.  I guess the main
takeway then is that POWER9 handles interpreted workloads quite badly, or is
there still some possibility of additional optimization here?

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-20 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #13 from Bill Schmidt  ---
This was prototyped and measured against the firmware fixes with
indistinguishable results.  So the complexity of a software solution, with its
impacts on Linux distributions, was not warranted.  (That is, the firmware
workarounds are already tightly targeted.)

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-05-19 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #12 from Timothy Pearson  ---
After quite a bit of investigation, this is down to the Spectre v2 user mode
protections on POWER9, which (from what I understand) involve completely
disabling the branch predictor.

My question then comes down to, why wasn't the retpoline style mitigation used
on ppc64el?  Nuking hardware elements seems extreme and is obviously causing
serious problems for direct branching code (like that being seen here).

What would be involved in creating a retpoline-type mitigation for ppc64el, so
that we can run with the branch predictor turned back on (already verified to
fix much of the performance issues seen on POWER9)?

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-09 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #11 from Bill Schmidt  ---
(In reply to Timothy Pearson from comment #10)
> 
> It's even slow compared to P8 with mitigations applied.  Do you have a link
> to the hostboot commit that may have enabled the P9 mitigation, or to the
> register name (SCOM) that was modified to enable the mitigation?

No, I'm sorry, I don't know those details.  If you contact me offline I can
probably find someone who does.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-05 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #10 from Timothy Pearson  ---
(In reply to Bill Schmidt from comment #9)
> You mentioned you're on a POWER9 machine.  It could be that you have
> firmware with Spectre mitigations applied, which will affect all indirect
> branches.  It may be that you do not have Spectre mitigations applied on
> your x86 machine, in which case the comparison would be expected to be quite
> different.  Depending on firmware levels, the mitigations may be able to be
> switched off, so you should check into that first.  PHP is known to be
> sensitive to indirect branch performance.
> 
> The Power landing page for these mitigations is
> https://www.ibm.com/blogs/psirt/potential-impact-processors-power-family/. 
> From here you should be able to get to further information for your specific
> hardware and OS version.

It's even slow compared to P8 with mitigations applied.  Do you have a link to
the hostboot commit that may have enabled the P9 mitigation, or to the register
name (SCOM) that was modified to enable the mitigation?

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-05 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #9 from Bill Schmidt  ---
You mentioned you're on a POWER9 machine.  It could be that you have firmware
with Spectre mitigations applied, which will affect all indirect branches.  It
may be that you do not have Spectre mitigations applied on your x86 machine, in
which case the comparison would be expected to be quite different.  Depending
on firmware levels, the mitigations may be able to be switched off, so you
should check into that first.  PHP is known to be sensitive to indirect branch
performance.

The Power landing page for these mitigations is
https://www.ibm.com/blogs/psirt/potential-impact-processors-power-family/. 
>From here you should be able to get to further information for your specific
hardware and OS version.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #8 from Richard Biener  ---
(In reply to Timothy Pearson from comment #4)
> (In reply to Andrew Pinski from comment #3)
> > This is 100% the equivalent code.
> > 
> > jmp *(%r15) # opline.199_67->handler
> > Does two things:
> > loads a pointer from %r15 and then jumps to that pointer.
> > 
> > In PowerPC, you can only jump indirectly via the CTR or LR registers.
> > 
> > ld 9,0(29)   # opline.200_67->handler, gotovar.1505_2678
> > mtctr 9  # gotovar.1505_2678, gotovar.1505_2678
> > bctr
> > 
> > 
> > Most likely what is happening is the indirect branch predictor is not
> > predicting the branch correctly on the powerpc side while it is on the x86
> > side.  This is a micro-architecture difference between the two chips and is
> > unrelated to the ISA differences.
> 
> I'm forwarding this for analysis to see if there's anything we can do in
> firmware to "fix" the branch predictor.  If not, is there a way to prime the
> predictor in this scenario, or is this too specific to be added
> compiler-side?

The usual way is speculative devirtualization, you replace

  jmp *(%r15)

with

  if (%r15 == constant-address)
jmp constant-address
  else
jmp *(%r15)

where the hope is this helps branch prediction.

Other than that - are there very many such indirect branches or is it just
one?

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #7 from David Edelsohn  ---
One possibility is bad luck and the branch happens to fall on an address that
conflicts with another branch.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #6 from Timothy Pearson  ---
Understood.  I'll update this report if we find a way to get the predictor
working optimally in this scenario.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

David Edelsohn  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|INVALID |---

--- Comment #5 from David Edelsohn  ---
The issue is *why* the branch predictor is not predicting it correctly. It may
be that the details of the branch predictor are causing the prediction to
conflict with another branch, for example, nullifying the correct prediction.
One should not leap to the conclusion that the predictor is not initialized.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #4 from Timothy Pearson  ---
(In reply to Andrew Pinski from comment #3)
> This is 100% the equivalent code.
> 
> jmp *(%r15) # opline.199_67->handler
> Does two things:
> loads a pointer from %r15 and then jumps to that pointer.
> 
> In PowerPC, you can only jump indirectly via the CTR or LR registers.
> 
> ld 9,0(29)   # opline.200_67->handler, gotovar.1505_2678
> mtctr 9  # gotovar.1505_2678, gotovar.1505_2678
> bctr
> 
> 
> Most likely what is happening is the indirect branch predictor is not
> predicting the branch correctly on the powerpc side while it is on the x86
> side.  This is a micro-architecture difference between the two chips and is
> unrelated to the ISA differences.

I'm forwarding this for analysis to see if there's anything we can do in
firmware to "fix" the branch predictor.  If not, is there a way to prime the
predictor in this scenario, or is this too specific to be added compiler-side?

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

Andrew Pinski  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
   Last reconfirmed|2018-04-05 00:00:00 |
 Resolution|--- |INVALID

--- Comment #3 from Andrew Pinski  ---
This is 100% the equivalent code.

jmp *(%r15) # opline.199_67->handler
Does two things:
loads a pointer from %r15 and then jumps to that pointer.

In PowerPC, you can only jump indirectly via the CTR or LR registers.

ld 9,0(29)   # opline.200_67->handler, gotovar.1505_2678
mtctr 9  # gotovar.1505_2678, gotovar.1505_2678
bctr


Most likely what is happening is the indirect branch predictor is not
predicting the branch correctly on the powerpc side while it is on the x86
side.  This is a micro-architecture difference between the two chips and is
unrelated to the ISA differences.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread tpearson at raptorengineering dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

--- Comment #2 from Timothy Pearson  ---
(In reply to David Edelsohn from comment #1)
> What two additional instructions?  x86 is a CISC architecture and Power is a
> RISC architecture.  x86 has an instruction that directly performs an
> indirect call through a pointer. Power must explicitly load the pointer and
> move it to the appropriate register to perform an indirect branch.
> 
> One can comment / questions that the *SEQUENCE* appears to require more time
> on Power than the equivalent sequence on x86. But directly comparing
> instructions and counting instructions in two different ISAs without context
> is not meaningful.

That is in fact what I am concerned with, the fact that the sequence is taking
longer than the equivalent sequence on x86.  I am aware that the two
instruction sequences accomplish the same goal, but for some reason the x86 one
is fast enough that it doesn't even show up in the perf output as a hot
instruction, while the ppc64 sequence stalls twice (two hot instructions), once
on the load and once on the register move.

[Bug target/85216] Performance issue with PHP on ppc64 systems

2018-04-04 Thread dje at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85216

David Edelsohn  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2018-04-05
 Ever confirmed|0   |1

--- Comment #1 from David Edelsohn  ---
What two additional instructions?  x86 is a CISC architecture and Power is a
RISC architecture.  x86 has an instruction that directly performs an indirect
call through a pointer. Power must explicitly load the pointer and move it to
the appropriate register to perform an indirect branch.

One can comment / questions that the *SEQUENCE* appears to require more time on
Power than the equivalent sequence on x86. But directly comparing instructions
and counting instructions in two different ISAs without context is not
meaningful.