On Dec 20, 2010, at 11:53 AM, Joe Crobak wrote:

> What's the "best" way to represent an optional enum in avro (in terms of 
> space efficiency, computational efficiency, and readability)?  To be 
> consistent with other optional fields, I was planning to use union of null 
> and my enum type.  The other approach I could see was adding a NULL field to 
> the enum -- but then my code would have to initialize the enum field to null 
> before a write.

The most space efficient would be the enum with a "NULL" option, but that is 
cumbersome as you mention for a few reasons.  If the field is often null, then 
the space efficiency difference would be minor.

enum:  always one byte (unless there are more than 63 options).
[null, enum]:  one byte when null, two otherwise.

I suggest doing whatever is most semantically correct for your data -- does it 
make _sense_ to have NULL be an option in the enum?  Would it make code that 
consumes the data simpler?  Having to specify it when writing is less of a 
concern, helper methods or the builder pattern are recommended for creating 
these objects for write in order to enforce that the fields conform and prevent 
user error.

> 
> I've tried to use union of null and the enum-type, but I've run into an issue 
> with this approach when using the AvroOutputFormat.  The following code 
> summarizes my issue:
> 
>   public void testDataWriteWithSchema() throws IOException {
>     final DataFileWriter<Event> writer =
>       new DataFileWriter<Event>(new SpecificDatumWriter<Event>());
> 
>     writer.create(Event.SCHEMA$, new File("target/datafile-test.avro"));
>     writer.append(getEvent());    
>     writer.close();
>   }
> 
>   public void testDataWriteWithSchemaWithClass() throws IOException {
>     final DataFileWriter<Event> writer =
>       new DataFileWriter<Event>(new SpecificDatumWriter<Event>(Event.class));
> 
>     writer.create(Event.SCHEMA$, new File("target/datafile-test.avro"));
>     writer.append(getEvent());    
>     writer.close();
>   }
> 
> 
This looks like a bug.  Can you file a ticket?  In the first case, the 
constructor is not initializing the SpecificData object, which means that it is 
using GenericData.resolveUnion() instead of SpecificData.resolveUnion(). 

The empty param constructor in SpecificDatumWriter should be
public SpecificDatumWriter() { super(SpecificData.get()); }


public SpecificDatumWriter() { }


> When I don't pass in the Event.class to SpecificDatumWriter (the first test 
> method), the above test fails with the following exception: 
> 
> Not in union ["null", 
> {"type":"enum","name":"Suit","namespace":"foo","symbols":["SPADES","CLUBS","HEARS","DIAMONDS"]}]:
>  SPADES
> 
> 
> at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:382)
> 
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:67)
> 
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:100)
> 
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
> 
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:54)
> 
> at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
> 
> 
> 
> AvroOutputFormat uses the SpecificDatumWriter's default c'tor, so I run into 
> the above exception when using it.  Is there some way around this (other than 
> implementing my own OutputFormat that passes along the class?).

Unfortunately, if it is hidden behind AvroOutputFormat, there isn't much more 
to do than that or a patched version of Avro.

> 
> Thanks,
> Joe
> 

Reply via email to