On Dec 20, 2010, at 11:53 AM, Joe Crobak wrote:
> What's the "best" way to represent an optional enum in avro (in terms of
> space efficiency, computational efficiency, and readability)? To be
> consistent with other optional fields, I was planning to use union of null
> and my enum type. The other approach I could see was adding a NULL field to
> the enum -- but then my code would have to initialize the enum field to null
> before a write.
The most space efficient would be the enum with a "NULL" option, but that is
cumbersome as you mention for a few reasons. If the field is often null, then
the space efficiency difference would be minor.
enum: always one byte (unless there are more than 63 options).
[null, enum]: one byte when null, two otherwise.
I suggest doing whatever is most semantically correct for your data -- does it
make _sense_ to have NULL be an option in the enum? Would it make code that
consumes the data simpler? Having to specify it when writing is less of a
concern, helper methods or the builder pattern are recommended for creating
these objects for write in order to enforce that the fields conform and prevent
user error.
>
> I've tried to use union of null and the enum-type, but I've run into an issue
> with this approach when using the AvroOutputFormat. The following code
> summarizes my issue:
>
> public void testDataWriteWithSchema() throws IOException {
> final DataFileWriter<Event> writer =
> new DataFileWriter<Event>(new SpecificDatumWriter<Event>());
>
> writer.create(Event.SCHEMA$, new File("target/datafile-test.avro"));
> writer.append(getEvent());
> writer.close();
> }
>
> public void testDataWriteWithSchemaWithClass() throws IOException {
> final DataFileWriter<Event> writer =
> new DataFileWriter<Event>(new SpecificDatumWriter<Event>(Event.class));
>
> writer.create(Event.SCHEMA$, new File("target/datafile-test.avro"));
> writer.append(getEvent());
> writer.close();
> }
>
>
This looks like a bug. Can you file a ticket? In the first case, the
constructor is not initializing the SpecificData object, which means that it is
using GenericData.resolveUnion() instead of SpecificData.resolveUnion().
The empty param constructor in SpecificDatumWriter should be
public SpecificDatumWriter() { super(SpecificData.get()); }
public SpecificDatumWriter() { }
> When I don't pass in the Event.class to SpecificDatumWriter (the first test
> method), the above test fails with the following exception:
>
> Not in union ["null",
> {"type":"enum","name":"Suit","namespace":"foo","symbols":["SPADES","CLUBS","HEARS","DIAMONDS"]}]:
> SPADES
>
>
> at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:382)
>
> at
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:67)
>
> at
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:100)
>
> at
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>
> at
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:54)
>
> at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>
>
>
> AvroOutputFormat uses the SpecificDatumWriter's default c'tor, so I run into
> the above exception when using it. Is there some way around this (other than
> implementing my own OutputFormat that passes along the class?).
Unfortunately, if it is hidden behind AvroOutputFormat, there isn't much more
to do than that or a patched version of Avro.
>
> Thanks,
> Joe
>