We agree that the potential source incompatibility is an acceptable price for the reduced bytecode complexity in Model 5. If the source incompatibility turns out to be more severe than expected, does it make more sense to bring back separate wildcards (?/ref, any), rather than bringing back the bytecode complexity of Model 4?
 
--
Bjørn Vårdal
 
----- Original message -----
From: Brian Goetz <[email protected]>
Sent by: "valhalla-spec-experts" <[email protected]>
To: [email protected]
Cc:
Subject: Wildcards -- Models 4 and 5
Date: Fri, May 20, 2016 2:36 PM
 

In the 4/20 mail “Wildcards and raw types: story so far”, we outlined our explorations for fitting wildcard types into the first several prototypes. The summary was:

  • Model 1: no wildcards at all

  • Model 2: A pale implementation of wildcards, with lots of problems that stem from trying to fake wildcards via interfaces

  • Model 3: basically the same as Model 2, except members are accessed via indy (which mitigated some of the problems but not all)

    The conclusion was: compiler-driven translation tricks are not going to cut it (as we suspected all along). We’ve since explored two other models (call them 4 and 5) which explore a range of options for VM support for wildcards. The below is a preliminary analysis of these options.

Reflection, classes, and runtime types

While it may not be immediately obvious that this subject is deeply connected to reflection, consider a typical implementation of equals():

      class Box<T> {
          T t;

          public boolean equals(Object o) {
              if (!(o instanceof Box))
                  return false;
              Box other = (Box) o;
              return (t == null && other.t == null)
                  || t.equals(other.t);
          }
      }

Some implementations use raw types (Box) for the instanceof and cast target; others use wildcards (Box<?>). While the latter is recommended, both are widely used in circulation. In any case, as observed in the last mail, were we to interpret Box or Box<?> as only including erased boxes, then this code would silently break.

The term “class” is horribly overloaded, used to describe the source class (class Foo { ... }), the binary classfile, the runtime type derived from the classfile, and the reflective mirror for that runtime type. In the past these existed in 1:1 correspondence, but no more — a single source class now gives rise to a number of runtime types. Having poor terminology causes confusion, so let’s refine these terms:

  • class refers to a source-level class declaration
  • classfile refers to the binary classfile
  • template refers to the runtime representation of a classfile
  • runtime type refers to a primitive, value, class, or interface type managed by the VM

So historically, all objects had a class, which equally described the source class, the classfile, and the runtime type. Going forward, the class and the runtime type of an object are distinct concepts. So an ArrayList<int> has a class of ArrayList, but a runtime type of ArrayList<int>. Our code name for runtime type is crass (obviously a better name is needed, but we’ll paint that bikeshed later.)

This allows us to untangle a question that’s been bugging us: what should Object.getClass() return on an ArrayList<int>? If we return ArrayList, then we can’t distinguish between an erased and a specialized object (bad); if we return ArrayList<int>, then existing code that depends on (x.getClass() == List.class) may break (bad).

The answer is, of course, that there are two questions the user can ask an object: what is your class, and what is your crass, and they need to be detangled. The existing method getClass() will continue to return the class mirror; a new method (getCrass()) will return a runtime type mirror of some form for the runtime type. Similarly, a class literal will evaluate to a class, and some other form of literal / reflective lookup will be needed for crass.

The reflective features built into the language (instanceof, casting, class literals, getClass()) are mostly tilted towards classes, not types. (Some exceptions: you can use a wildcard type in an instanceof, and you can do unchecked static casts to generic types, which are erased.) We need to extend these to deal in both classes and crasses. For getClass() and literals, there’s an obvious path: have two forms. For casting, we are mostly there (except for the treatment of raw types for any-generic classes — which we need to work out separately.) For instanceof, it seems a forced move that instanceof Foo is interpreted as “an instance of any runtime type projected from class Foo”, but we also would want to apply it to any reifiable type as well.

Wildcard types

In Model 3, we express a parameterized type with a ParamType constant, which names a template class and a set of type parameters, which include both valid runtime types as well as the special type parameter token erased. One natural way to express a wildcard type is to introduce a new special type parameter token, wild, so we’d translate Foo<any> as ParamType[Foo,wild].

In order for wildcard types to work seamlessly, the minimum functionality we’d need from the VM is to manage subtyping (which is used by the VM for instanceof, checkcast, verification, array store checks, and array covariance.) The wildcard must be seen to be a “top” type for all parameterizations:

ParamType[Foo,T] <: ParamType[Foo,wild]  // for all valid T

And, wildcard parameterizations must be seen to be subtypes of of their wildcard-parameterized supertypes. If we have

       class Foo<any T> extends Bar<T> implements I<T>       { ... }
       class Moo<any T> extends Goo { }

then we expect

ParamType[Foo,wild] <: ParamType[Bar,wild]
ParamType[Foo,wild] <: ParamType[I,wild]
ParamType[Moo,wild] <: Goo

Wildcards must also support method invocation and field access to the members that are in the intersection of the members of all parameterizations (these are the total members (those not restricted to particular instantiations) whose member descriptors do not contain any type variables.) We can continue to implement member access via invokedynamic (as we do in Model 3, or alternately, the VM can support invoke* bytecodes on wildcard receivers.)

We can apply these wildcard behaviors to any of the wildcard models (i.e., retrofit them onto Model 2/3.)

Partial wildcards

With multiple type variables, the rules for wildcards generalize cleanly, but the number of wildcard types that are a supertype of any given parameterized type grows exponentially in the number of type variables. We are considering adopting the simplification of erasing all partial wildcards in the source type system to a total wildcard in the runtime type system (the costs of this are: some additional boxing on access paths where boxing might not be necessary, and unchecked casts when casting a broader wildcard to a narrower one.)

Model 4

A constraint we are under is: existing binaries translate the types Foo (raw type), Foo<String> (erased parameterization), and Foo<?> all as LFoo; (or its equivalent, CONSTANT_Class[Foo]); since existing code treats this as meaning an erased class, the natural path would be to continue to interpret LFoo; as an erased class.

Model 4 asks the question: “can we reinterpret legacy LFoo; in classfiles, and Foo<?> in source files, as any Foo“ (restoring the interpretation of Foo<?> to be more in line with user intuition.)

Not surprisingly, the cost of reinterpreting the binaries is extensive. Many bytecodes would have to be reinterpreted, including new, {get,put}field, invoke*, to make up the difference between the legacy meaning of these constructs and the desired new meaning. Worse, while boxing provides us a means to have a common representation of signatures involving T (T’s bound), in order to get to a common representation for signatures involving T[], we’d need to either (a) make int[] a subtype of Object[] or (b) have a “boxing conversion” from int[] to Object[] (which would be a proxy box; the data would still live in the original int[].) Both are intrusive into the aaload and aastore bytecodes and still are not anomaly-free.

So, overall, while this seems possible, the implementation cost is very high, all of which is for the sake of migration, which will remain as legacy constraints long after the old code has been migrated.

Model 5

Model 5 asks the simpler question: can we continue to interpret LFoo; as erased in legacy classfiles, but upgrade to treating Foo<?> as is expected in source code? This entails changing the compilation translation of Foo<?> from “erased foo” to ParamType[Foo,wild].

This is far less intrusive into the bytecode behavior — legacy code would continue to mean what it did at compile time. It does require some migration support for handling the fact that field and method descriptors have changed (but this is a problem we’re already working on for managing the migration of reference classes to value classes.) There are also some possible source incompatibilities in the face of separate compilation (to be quantified separately).

Model 5 allows users to keep their Foo<?> and have it mean what they think it should mean. So we don’t need to introduce a confusing Foo<any> wildcard, but we will need a way of saying “erased Foo”, which might be Foo<? extends Object> or might be something more compact like Foo<erased>.

Comparison

Comparing the three models for wildcards (2, 4, 5):

  • Model 2 defines the source construct Foo<?> to permanently mean Foo<erased ref>, even when Foo is anyfied, and introduces a new wildcard Foo<any> — but maintains source and binary compatibility.
  • Model 4 let’s us keep Foo<?>, and retroactively redefines bytecode behavior — so an old binary can still interoperate with a reified generic instance, and will think a Foo<int> is really a Foo<Integer>.
  • Model 5 redefines the source meaning of Foo<?> to be what users expect, but because we don’t reinterpret old binaries, allows some source incompatibility during migration.

I think this pretty much explores the solution space. Our choices are: break the user model of what Foo<?> means, take a probably prohibitive hit to distort the VM to apply new semantics to old bytecode, or accept some limited source incompatibility under separate compilation but rescue the source form that users want.

In my opinion, the Model 5 direction offers the best balance of costs and benefits — while there is some short-term migration pain (in relatively limited cases, and can be mitigated with compiler help), in the long run, it gets us to the world we want without permanently burdening either the language (creating confusion between Foo<?> and Foo<any>) or the VM implementation.

In all these cases, we still haven’t defined the semantics of raw types. Raw types existed for migration between pre-generic and generic code; we still have that migration problem, plus the new migration problems of generic to any-generic, and of pre-generic to any-generic. So in any case, we’re going to need to define suitable semantics for raw types corresponding to any-generic classes.

 

Reply via email to