Re boxing: Good idea to minimize it.

Re storing type: this is a time/space tradeoff. If you store the union type in each record then you can write instances faster, since you can use type.ordinal(), and also dispatch switch statements for the value faster in user code. A hash table lookup is probably not a lot faster than an if/then/elseif/... unless the union is really big. Personally I'd opt for performance over slightly smaller instances here. I suppose we could offer both options, since it shouldn't alter the public API.

Also, we don't need to store a class in the enum. At code generation time we can simply write, e.g., Integer.class wherever you put INT.type.

Doug

On 06/10/2010 02:09 PM, Scott Carey wrote:

On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote:

On 06/10/2010 09:27 AM, Scott Carey wrote:
I propose:
* We use getters/setters, and allow setters and constructors to be
package protected with an option so the objects can be safely passed
around outside the package without stateful wrapper objects.  Users
can write factory methods or other static helper methods to abstract
away as much as they would like from there.
* Setters should not allow invalid data to be set.

I'd be okay with these changes.  +1

* Provide some built-in mechanism to resolve unions on a read and
verify them on a write.  Perhaps a inner static enum for each union,
and a getter for each branch.

So a schema like:

{"type":"record", "name":"Foo", "fields":[
   {"name":"bar", "type":["int", "float"]}]}

might generate something like:

public class Foo {
   public enum BarType { INT, FLOAT }
   private BarType bar$Type;
   private Object bar;
   public void setBar(int value) { bar = value; bar$Type = INT; }
   public void setBar(float value) { bar = value; bar$Type = FLOAT; }
   public int getBar$Int() {
     if (bar$Type != INT) throw new ...;
     return (int)bar;
   }
   public float getBar$Float() {
     if (bar$Type != FLOAT) throw new ...;
     return (float)bar;
   }
}


I was trying to think of a way to storing the type in there to keep the memory 
footprint minimized.
The getter name disambiguation needs a good format, and the $ delimiter above 
would work well and be clear to users.
However, I was thinking something more along these lines:

{code}
public class Foo {
   // intrinsic simple types could be shared rather than in each class
   // named types must be per union.
   public enum BarType {
     INT(Integer.class),
     FLOAT(Float.class);
     private Class<?>  type; // internal use only
     BarType(Class<?>  type) {
       this.type = type;
     }
   }
   private Object bar;
   // Integer vs int: avoid auto-unbox, auto-rebox if the user has an Integer 
already
   public void setBar(Integer value) { bar = value; }
   // same, Float vs float
   public void setBar(Float value) { bar = value; }
   public Integer getBar$Int() {
     // we could return null instead of throwing.  Return Integer, client can 
un-box if needed
     if (bar.getClass() == INT.type) return bar; else throw new ...;
   }
   public Float getBar$Float() {
     if (bar.getClass() == FLOAT.type) return bar; else throw new ...;
   }
   public BarType getBar$Type() {
     // we could use a static ConcurrentHashMap<Class<?>, BarType>   instead of 
cascaded if/else if/ ...
     Class<?>  c = bar.getClass();
     if (INT.type == c) {
        return INT;
     } else if (FLOAT.type == c) {
         return FLOAT;
     } else {
       throw new NeverHappensException();  // :)
     }
}
{code}

I think the above is thread-safe with no race conditions, though if one thread 
is calling the setter toggling the type another thread calling getBar$Type 
might see a 'delayed' type.

Since the ENUM is exposed to the client, it also needs a regular naming pattern 
so that schema evolution doesn't change the method names.

A user accessing an enum can then switch on the enum and call the appropriate 
method.


Is that the sort of thing you have in mind?  I've added dollar signs to
avoid potential conflicts, e.g., if the user had a field named 'barType'
or 'barInt'.  I think its best to always add the dollars, since
otherwise it could get awkward if a field named 'barInt' were added later.

* Unions with one type and NULL should not translate to Object, but
rather be that type and allow null.

That's already the case, no?

Also, you didn't provide a proposal for constructors, yet, right?

Yes, I'll need a more complicated example to discuss that one appropriately.  A 
pattern that aids in business object<>  Avro translation, especially in the 
presence of class hierarchies and a couple different object composition patterns. A 
Union as a representation of a subclass is tricky.  So is it when modeling 
composition.

Perhaps all of the setters could be on a generated builder class, with
an instance only returned when it's completely populated?

Providing builder classes rather than constructors could be extremely useful to 
aid in encapsulation.  It should address my constructor concern and remain 
flexible.



Doug

Reply via email to