Fwd: optimizing acmp in L-World

John Rose Sun, 04 Feb 2018 10:27:05 -0800

FTR, here are some observations on optimizing new acmp semantics which will 
help motivate adopting the L-world design.

From: John Rose <john.r.r...@oracle.com>
Date: January 29, 2018 at 9:42:49 PM GMT+1
To: valhalla-dev <valhalla-...@openjdk.java.net>
Subject: optimizing acmp in L-World

Tobias, Roland, and I had a chat in Santa Clara last week
about optimizing the acmp operation in the JVM.  Here are
my notes on it.

Background:  The acmp operation, also known as reference
comparison, is overloaded in L-World to handle values as well,
with a specific semantics of returning false if either operand is
a value (similar to NaN behavior for fcmp).  The rationale for
this is that acmp is reference comparison, even in L-World,
and since values have no references, they can never be
equal as references.

(Alternatively, it is *as if* both operands were converted
*to references* before the acmp operation, which means
that either operand, if it is a value, *would be* boxed to a
reference value, which then, being a new reference, is
automatically unequal to any other reference.  This tricky
account may be appealing in some circumstances.)

OK, so how do we make it fast?  Ideally, we would like
to strength-reduce the enhanced acmp instruction down
to the original acmp instruction, which after all is just a
single instruction (such as cmpq) on all platforms.

The first thing to notice, before going into details, is
that in L-world all Q-values are buffered in the interpreter.
This means that, at least for interpreter-generated values,
all values can be treated as physical pointers, even
if they are logically Q-values.

(Crib sheet:  In L-world, all kinds of values except primitives
are formally carried by L-values, under L-descriptors.  True
references can be referred to as R-values of R-type, reference
type, or object class.  New value types are referred to as Q-values
of Q-type, value type, or value class.  When you don't know what
you have, you can say it's a U-type, which just like a Q-type or
R-type is carried by an L-value under an L-descriptor.)

This means that the old acmp semantics are almost correct,
in many cases.  If you have two L-values and do a pointer
comparison on their physical pointers, and if the pointers
differ, you are done.  If the physical pointers compare equal,
then there is one more thing you have to do:  Make sure that
the L-value carried by both physical pointers is in fact a
reference, an R-value.

So the logic looks like this, and the new semantics are in
the second "if" statement:

 bool new_acmp(ptr x, ptr y) {
   if (!old_acmp(x, y))  return false;  // x != y => false
   if (is_value_ptr(x))  return false;  // x.is_value => false
   return true;
 }
 bool old_acmp(ptr x, ptr y) {
   return x == y;  // cmpq etc.
 }

Note that since acmp is symmetric, the JVM has the option
to silently test either x or y for "is_value_ptr" if the two physical
pointers compare equal.

This leads immediately to the first set of optimizations, which
probably are enough to get us where we want to go.  (There
is a second set we can add later.)  All the JVM has to do is
prove or detect that one or the other physical pointer either
*is certainly* a Q-value, or *is certainly not* a Q-value.

If either physical pointer is certainly a Q-value, then the
new_acmp instruction is a constant false.  If either physical
pointer is certainly not a Q-value, then the new_acmp
instruction strength reduces to the old_acmp instruction.

The strength reduction to false can be done in any place
where the type of either operand is known to be a Q-type.
The interpreter doesn't track types, and the acmp instruction
isn't resolved so it can't be quickened to a type context,
but the JIT has lots of type information.  Thus, if either
operand of acmp is statically known (or accurately speculated)
to be a value type, then the acmp instruction constant folds
to false.

Likewise, if either operand is known to be an R-type
(true reference) the acmp may be strength reduced to
a simple reference comparison.  This also is likely to
happen in the JIT.  It also happens in the interpreter
in the acmp_null instruction, where (obviously) the
second operand is a reference (i.e., null).

(Note that the type Object is a U-type in L-world, so
Object-to-Object comparisons such as are found in
generic collection code do not benefit from this
optimization.  More later on that.)  

The JIT can also track, in its profile, whether various
instructions "see" only Q-values or only R-values.
(The classic HotSpot already tracks "sightings" of
null for similar purposes.)  In that case, speculative
narrowings (only Q-types or only R-types) can be
used to short circuit new_acmp down to old_acmp.

This is useful because the speculation can sometimes
be verified out of a loop body, where the actual acmp
instruction is executed in a loop.  Remember that the
JIT can choose which operand of acmp to test for
value-ness; if it chooses a loop invariant, and the
invariant is always Q or always R, then the acmp
inside the loop folds up, no matter what the dynamic
value of the other operand.

For that matter, even if speculation fails, loop
unswitching can get to a similar result, at the cost
of two copies of the loop.

So, in many cases, contextual information can allow
the JVM to fold up new_acmp to old_acmp or constant
false.

In cases such folding fails, there is as simple trick
which can make things easier.  Recall that folding
fails when *both* operands are statically U-types,
*neither* operand folding to a Q-type or R-type.

In that case, the JVM must pick one operand or
the other (it doesn't matter which) and perform
the is_value_ptr operation on it, in such a way
that the result of that operand affects the final
result in the correct way.  A naive implementation
of new_acmp adds an extra test-and-branch
to fold in the is_value_ptr test, but a clever might
be able to avoid it.

Suppose there is a branch free technique for
implementing is_value_ptr, such as extracting
a bit from the physical pointer to the U-value,
or (after a null check) extracting a bit from the
physical pointer of the class metadata pointer
in the value's header, or (after two indirections)
extracting a bit from the structure of the class
metadata.  (Mr. Simm's prototype puts such
a bit in the metadata pointer, in the object
header.)

If such a bit can be obtained in a branch-free
manner, then it can be used to perturb the result
of the old_acmp in a useful manner.  For example:

 bool new_acmp(ptr x, ptr y) {
   if (x == null)  return y == null;
   int bit = (x->class_in_header & Q_TYPE_BIT);
   ptr x1 = x | swar_broadcast_bit(bit != 0);
   return old_acmp(x1, y);
 }
 int swar_broadcast_bit(bool z) {
   return z ? -1 : 0;  // use ALU ops
 }

The idea is to extract a bit which signals the presence
of a Q-type (in either operand) and use it to "mess up"
the equality comparison, using and, or, or xor as needed.

This perturbation technique has two costs:  First,
it requires a null check (unless the Q-bit is in the
actual physical reference).  Second, it requires
an ALU operation or two to spread the perturbation.
But these costs will probably be negligible, except
(perhaps) in very tight loops in generic code.

Generic code loops performing frequent acmp operations
on untyped (Object) operands will need to perform
extra null checks and/or value/reference detections
if they are not already present (by luck) in the loop
context.  In this case, loop unswitching may be useful.

There is one more contextual operation which may be
helpful in such cases, if all else fails (and loop unswitching
is not desirable).  Honestly, I haven't seen any real-world
code yet that needs it, but it is comforting to have as a
fall-back.  The idea is this:  If the acmp is contextually
followed by a call to Object.equals, in the usual Legacy
Idiom For Equality (L.I.F.E.), then there is one more trick
we can play.  We'd like to force the new_acmp down to
an old_acmp in these cases.  Can we?  Here's the choice:

 bool dutiful_life(ptr x, ptr y) {
   if (new_acmp(x, y))  return true;
   return x->Object_equals(y);
 }
 bool clever_life(ptr x, ptr y) {
   if (old_acmp(x, y))  return true;
   return x->Object_equals(y);
 }

Note that the only semantic difference between old_acmp
and new_acmp arises when *both* operands are the *same*
value.  (Work out the cases:  In every other case, the simple
pointer comparison detects the difference and produces the
correct "false" answer.)  Thus, in the L.I.F.E., the only case
where old_acmp gives the wrong answer produces an
immediate answer of "true", instead of following up with
a call to x.equals(x) on the Q-value.

To put it in a nutshell, the only difference between dutiful_life
and clever_life is that the former always calls Object::equals
if either operand is a value, and the the latter calls
Object::equals in a subset of those cases (when the
physical pointers differ).  When the physical pointers
do *not* differ, there is no call to Object::equals.  But
clever_life always delivers the same answer as dutiful_life.

So is there a problem?  Not much, although it requires us to treat
Object::equals as an intrinsic with some predictable semantics.
The crucial inside that allows us to adopt clever_life is that,
if you call x.equals(x) on a non-null x, the result must always
be true.  Now, this is not enforced by the JVM, but it is clearly
documented in the API specification for Object::equals, and
lots of stuff would be already broken if that were ever false.

Thus, if we are willing to rule out implementations of Object::equals
for which a value can ever compare *not equal* to itself, then
we can substitute clever_life for dutiful_life.  We also have to
allow one more thing:  That we can silently drop calls to equals
(at least in the context of L.I.F.E.) when the JVM can prove that
the contract of equals allows the JVM to predict the outcome.
This means that if you put side effects (like print statements)
into an equals method on a value type, you might not see the
side effects, after the JVM has optimized things.  This corner
case may require further thought in order to straighten it out.

Note that L.I.F.E. is frequently found in generic code, where
instead of operating on Object values it operates on type
parameter types.  (The JVM just sees Object, after erasure.)
In that case, if the JVM were to specialize the generic code
separately for some types (such as value types), along the
lines of the template class proposals, then the L.I.F.E.
code would fold up in the JIT using the rules given at the
top of this note.  The generic expression "a == b", where
a or b was a value of a specialized type parameter T,
would fold up to false in every specialization where T was
a Q-type.  Likewise, as noted above, if L.I.F.E. is applied
to a true R-type (or specialized to such a type), then any
new_acmp on that type can be short-circuited to the faster
old_acmp.

All in all, it appears that the cost of the new acmp semantics
can, for all practical purposes, be pushed down into the noise.

— John

Fwd: optimizing acmp in L-World

Reply via email to