First, credit to @TobiHartmann for the diagnostic, and a lot of the solution.

# Diagnostic

According to [Strict Field Initialization 
JEP](https://openjdk.org/jeps/8350458), when a strict field is being 
initialized, it is not quite immutable, but observally immutable: at first, the 
field can be only set (during the early larval phase), then it can be only read 
(late larval or initialized phase), so the last set is the actual value one can 
ever observe.

The interesting part is that in early larval phase, a field can be subject to 
some side effects. When applied to a value object, that means that until it 
reaches the unrestricted state, it is not yet immutable. While being not 
theoretically necessary, avoiding scalarization and keeping the value object 
behind a reference is a convenient way to make sure that side effects are 
correctly applied. This strategy means that we shouldn't scalarized before 
reaching the unrestricted state. Normally, in C2, finding out what is early 
larval or not is the job of bytecode parsing, but in OSR compilation, 
everything about the StartOSR is not parsed, and thus some objects are soundly 
assumed that they might be larval, when they actually aren't. In the reported 
example, that leads to drastic performance difference between OSR and non OSR 
compilation: the second one is able to eliminate allocations since it knows 
more precisely when the value object can be scalarized.

In the original example:

public value class MyNumber {
    private long d0;
    private MyNumber(long d0) { this.d0 = d0; }
    public MyNumber add(long v) { return new MyNumber(d0 + v); }

    private static void loop() {
        MyNumber dec = new MyNumber(123);
        for (int i = 0; i < 1_000_000_000; ++i) {
            dec = dec.add(i);
        }
    }

    public static void main(String[] args) {
        for (int i = 0; i < 10; ++i) {
            loop();
        }
    }
}

OSR happens in the loop in `loop`, but here `dec` is not detected to be 
unrestricted (so immutable, so scalarizable), so the allocation in inlined 
`add` still needs to happen because we need the buffer for the new `dec`. The 
first iteration traps at the exit of the loop (unstable if), OSR happens again, 
followed by a non-OSR compilation, finding correctly that `dec` can be 
scalarized in the loop, making the third iteration fast.

# Solution

Overall, the solution requires to improve our detection of early larval values. 
Since we keep parsing as-is, let's do that in `ciTypeFlow`. The logic is quite 
simple: `ciInlineKlass` can be wrapped in `ciWrapper` to mark when they are 
known to be non-null. We extend `ciWrapper` to also be able to tell when a 
value is early larval (which also implies non-null since early larval means 
that `new` happened).

Then, it's almost a state machine! How is "early larval" introduced:
- `new` bytecode
- receiver of a constructor

How larval is removed:
- call to constructor

Other ways to obtain values are safe (I've experimented and it seems the 
verifier was keeping me from doing something bad):
- returned values are not early larval
- parameters are not early larval
- receiver of everything but constructor are not early larval

And other guarantees:
- calling twice the constructor is not legal
- merging paths with inconsistent state (early larval vs. unrestricted) is also 
disallowed

Let's detail how to remove the early larval state on a `ciInlineKlass`. When 
invoking a method on a receiver, we know that the receiver is not early larval 
after: either it's a constructor (which initialize the object), or it's a 
regular method, and the verifier makes sure the receiver was already not early 
larval before. The tricky part is to remove the larval state on all the copies 
of the reference: when doing something such as

new SomeValueClass
dup
dup
astore 1
invokespecial <init>

we still have a copy of the same reference in local 1 and on the stack. All of 
them now refers to a non-larval value. To take that into account, one must scan 
all the cells (locals and stack), and remove the early larval wrapper to all 
the types sharing the same ident  (see `void ciBaseObject::set_ident(uint id)` 
and `void ciObjectFactory::init_ident_of(ciBaseObject* obj)`). That is enough 
since early larval values cannot be contained in another object (since we can't 
make use of them before reaching unrestricted state, see also [Flexible 
Constructor Bodies](https://openjdk.org/jeps/513)).

Another trick is about the super constructor invocation in a constructor: is 
that different from a constructor invocation outside of a constructor. Well 
yes, but actually no:
- outside a constructor, a constructor invocation leaves the object initialized
- in a constructor, a constructor invocation doesn't make the object 
initialized because the all constructor is not done. But, calling the base 
class' constructor ensures that `Object.<init>` is being called, climbing up 
the hierarchy. Once `Object`'s constructor has been reached, the object is late 
larval.

In both case, after a constructor invocation, the object is unrestricted (late 
larval or initialized), and in the case of a value object, immutable, so 
scalarizable, which is what matters here.

The case of an abstract value class needs a bit of hack. An abstract value 
class is an `InstanceKlass` and a `ciInstanceKlass`, so we should take care of 
not doing the aforementioned logic only for `ciInlineKlass`, but also for 
`ciInstanceKlass` that are not identity classes.

# Tests

Testing is not quite easy. First, there is a microbenchmark.

Running

make test 
TEST="micro:org.openjdk.bench.valhalla.loops.osr.LarvalDetectionAboveOSR" 
CONF_NAME=linux-x64

before the fix would give

Iteration   1: 1874.498 ms/op
Iteration   2: 1662.409 ms/op
Iteration   3: 204.573 ms/op
Iteration   4: 201.177 ms/op
Iteration   5: 201.526 ms/op
Iteration   6: 200.718 ms/op
Iteration   7: 198.127 ms/op
Iteration   8: 200.535 ms/op
Iteration   9: 201.079 ms/op
Iteration  10: 203.341 ms/op

and now

Iteration   1: 212.941 ms/op
Iteration   2: 225.214 ms/op
Iteration   3: 204.319 ms/op
Iteration   4: 199.884 ms/op
Iteration   5: 201.875 ms/op
Iteration   6: 202.331 ms/op
Iteration   7: 200.829 ms/op
Iteration   8: 200.794 ms/op
Iteration   9: 200.428 ms/op
Iteration  10: 203.278 ms/op

That's better! For this microbench, we need to do everything that we shouldn't 
usually do: no warmup, single measure... Indeed, as soon as we do more than 
OSR, the effect disappear.

Otherwise, it's nice to see that the allocation is not there anymore in OSR. 
That could be an IR framework test, but alas, IR framework doesn't support OSR 
so easily. The test is thus running a subprocess and parsing the output quite 
basically. This should be ported to an IR test once the IR framework supports 
OSR.

Thanks,
Marc

-------------

Commit messages:
 - cleanup
 - abstract value class
 - Adjust microbench + parameters are never early larval
 - Less printing for testing
 - Trying
 - working....

Changes: https://git.openjdk.org/valhalla/pull/1531/files
  Webrev: https://webrevs.openjdk.org/?repo=valhalla&pr=1531&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8361352
  Stats: 424 lines in 12 files changed: 386 ins; 18 del; 20 mod
  Patch: https://git.openjdk.org/valhalla/pull/1531.diff
  Fetch: git fetch https://git.openjdk.org/valhalla.git pull/1531/head:pull/1531

PR: https://git.openjdk.org/valhalla/pull/1531

Reply via email to