Re: Question about universal type variables

Brian Goetz Mon, 08 Aug 2022 10:25:28 -0700

Let's try and separate the various things going on, and then we can seeif there are attractive fictions we want to spin. First, let's talkabout kinds of polymorphism. Cardelli and Wegner's "On understandingtypes, data abstraction, and polymorphism" (1985) divides polymorphisminto a hierarchy (though these distinctions predate this paper):


Polymorphism
    Universal
        Parametric
        Inclusion
    Ad Hoc
        Overloading
        Coercion

Inclusion polymorphism is subtyping; the value set of String is includedin the value set of Object.Coercion polymorphism is conversions; we can use a `short` where an`int` is called for, because we can summon an `int` with the same valueas the `short` at will.Overloading refers to the fact that we can declare `f(int)` and`f(long)` so that at the use site, `f` appears to take multiple types.(Pattern matching fits into ad-hoc polymorphism, but in Java it getsfunneled through the other forms first. Union types are another form ofad-hoc polymorphism.)


The special behavior of `null` could be explained by multiple paths:

- Subtyping, with the Null type as a bottom type for all referencetypes, or - Ad-hoc, where it is understood that `null` is in the value set ofevery reference type, and treating an unbounded T as an (infinite) uniontype.

I think the latter path for explaining null is more useful in general,and it is probably closer to what the JLS actually says - this is howinterfaces "inherit" members like Object::equals. (I think it alsooffers a more useful way to talk about non-nullable reference types, butI'll come back to that.)

Java exhibits all of these forms of polymorphism. Parametric andinclusion are on prominent display, but the others are there too, andcoercion is actually quite relevant, both in general and to the pointyou are making which is about how the user thinks about what it means toinstantiate a generic class with `int`. I think we will need all ofthese tools to get to "everything is an Object" (which I think we agreeis a desirable unification.)

A `String` is an Object through inclusion. An `int` is an Objectthrough coercion; if we have an `int` and we need an `Object`, we canbox the `int` to `Integer`. Today we do this only for assignment, butgoing forward we will do this in other contexts, such as member access(e.g., `1.toString()`, equality, array covariance, and serialization. We heal the multiple rifts through a combination of subtyping and coercion.

So, in the world of universal type variables, what is a T? I claim itis a union over the set of all types that conform to T's bound. Todaythis includes only reference types, but once we extend boundsconformance to admit types that are convertible to T's bound, thisincludes value types as well.

This union offers a rational explanation for why we can say`t.toString()` -- because `toString()` is a member of every type in theunion (when enumerate the members of a union type, you take the_intersection_ of the members in all the types in the union). We leaveit to the implementation as to how to actually dispatch `toString()`,which will be different depending on whether we specialize `Foo<T>` ornot. It also offers a rational explanation of why `T` has `null` in itsvalue set today -- and why we are going to adjust this to generateunchecked warnings tomorrow -- because now we'll be intersecting in sometypes that don't have null. The same is true for `synchronized` --which has nothing to do with reference vs value, but with identity --and again, we're now adding new types to the union that don't have apreviously universal property.

The union model is based on the "stand in" model -- T can stand for someunknown type, so you can at most do things on a T that you can do on*all* the unknown types. (Even when we get to specialized generics, wemight still not allow all such operations, such as `new T[n]`; the unionoffers an upper bound on what is sensible, but languages can be morerestrictive.)

The best way I've found to think about types like `String!` in Java isas _refinement types_. (See Liquid Haskell(https://ucsd-progsys.github.io/liquidhaskell-tutorial/), or ClojureSpec (https://clojure.org/guides/spec)). A refinement type takes a typeand a _predicate_ which refines its value set, such as "even integer",and can contain arbitrary predicative logic. The compiler then attemptsto prove the desired properties (easier in functional languages). Inother words, the type `String!` takes as its base the reference type`String`, along with a predicate `s -> s != null`. Taking away the nulldoesn't change the reference-ness of it, it just restricts the value set.

Interestingly, the languages that have the most direct claim tomodifiers like `!` and `?` treat them as _cardinalities_, such as X# andto a lesser degree XSL. In X#, where "everything is a sequence",cardinality modifiers are: refinement types! They constrain the lengthof the sequence (imagine a refinement type on List which said "size() >3".)

We're clearly never going to plunk for arbitrary predicative logic inour type system and the theorem provers that come with them, but ad-hocpredicates like "not null", "has identity" and "is reference" arealready swimming under the surface of the type system we have, and we'llsee more like this when we get to specialization (where we will modelspecialized instantiations as refinements rather than substitution.)



OK, with that as background, let's dive into your mail.

I'm sure the theoretic argument is fine as far as it goes, but it'snot much help for the end user. My issue is with the user model wepresent to the world; what "useful fictions" are we securing for them,that enable them to read and write code with confidence?

One locus of potential fiction is what we mean by "is" in "everything isan Object". If a T is an Object, do we really just mean "things thatare subtypes of Object", or do we mean "things that can be bounded byObject" (which includes value types via conversion/coercion, rather thanvia subtyping.) I think ultimately the latter is more helpful, becausewhen someone says `ArrayList<long>`, what they really want is anArrayList that is backed by a long[], with all the non-nullability,flatness, and tearability that long already has. `ArrayList<T>` can bethought of something that "has Ts" in it; if we are substituting inT=long, we will want all the properties of long because that allows forgreater compositionality of semantics.

*Some "T always a reference type" advantages:*
* With subtype polymorphism, the user enjoys a solid understandingthat "reference types are polymorphic, value types are monomorphic".As I'd put it: you can never have a value (say as a field) withoutstatically knowing its exact type, because its exact type governs theshape and interpretation of the bits actually making up the value.Don't know the exact type --> you need a reference. Butparametric polymorphism (thanks for laying out these terms in the JEPdraft, Dan) feels very similar! I'd expect the user to consult thesame intuitions we just drilled into them about subtype polymorphism.It would be nice if the same simple rule held there too.

I think this tries to flip around "reference types are polymorphic" into"polymorphic types are references." T is polymorphic, users will getthat without trouble. But does it have to be inclusion polymorphism? Ithink it is an ad-hoc union between coercion (value types) and inclusion(reference types).

If we push towards the fiction of "they're all reference types", thenFoo<long> really means Foo<Long>, with all the nullability andtearability differences between long and Long.

* When my class gets used as `MyClass<int>`, I would get to reasonlike so: * When that code runs on some JVM that doesn't do specializationyet, then my class gets used directly, so those `int`s are really`Integer`s; of course they are, because T is a reference type. (Iexpect I can't tear a value this way.)

I would say it differently: in this world, `long` *erases to* `Object`,just as `String` does. Which means it will inherit some of theproperties of Object that String doesn't have, such the chance for heappollution. Similarly, when we erase `long` to `Object`, we pick up someof these properties too, including the additional chance of nullpollution, as well as some atomicity we didn't ask for. But that'sbecause of the erasure, not for any intrinsic property of typevariables. And the compiler will try to claw back some of thatnullability with unchecked warnings anyway, just as we try to claw backsome of the vectors for heap pollution. The nullity of T is the sameerasure-driven pollution we already know and tolerate.

* When that code runs on some JVM that has specialization, thendifferent "species" of my class are being forked off from my template,each one physically /replacing/ T with some value type. So /those/ arevalue types, but once again T is still a reference type. (And here Ido expect tearing risk, for non-atomic types.)

When I specialize `Foo<long>`, any T-valued fields or arrays or methodparameters really are long, with all the characteristics of long. Treating them as references (which have properties long doesn't have)seems more confusing. "Placeholder, which collapses to itsinstantiation" feels more natural here?

* If Java might ever to have non-nullable reference types, I suspectit might immediately expose this whole type variable issue as havingbeen, at its essence, never really about ref-vs-val in the firstplace. What it's really about is that there used to be one value inthe union of every Object type's value set, and now there isn't anymore.

Agree -- it was always about the union of types / intersection ofproperties of those types. Null used to be in that intersection, butnow things got more complicated -- but doesn't this argue against thereference interpretation, and towards the placeholder/union interpretation?

* The best way a user can prepare their generic class for becoming"universal" in the future is to adopt aftermarket nullness analysis(such as I'm working on standardizing the semantics for in JSpecify).They'll mark type parameters like `V extends @Nullable Object`, andmethods like `Map.get` will return `@Nullable V`. That will shake outany obstacles up front. Then once V becomes a UTP, they'd just changethat `V` to `V.ref`, and they could presumably drop the `@Nullable`too because `.ref` implies it (why else would it be used?). So thelanguage feature you're introducing for ref-vs-val universality isimmediately doing double duty, capturing nullness information forreference types too.
This would probably mean rethinking the `T.ref` syntax to somethingthat more closely evokes "T or null" (the fact this would, for an<int> species, have to box to `Integer` in the process seems intuitiveenough).

Open to finding a better way to spell "T or null"; I think the path tothis involves having this conversation converge :)

Re: Question about universal type variables

Reply via email to