Hi, This is my first post to this list. I'm writing a binding API for our database product, to allow users to easily store Avro binary data in the database and use any of the built-in Avro object representations (Generic, Specific, etc) as well as one we've added (JsonNode) by subclassing the Generic classes.
In our binding API, we don't support object reuse. So the Utf8 class has no real benefit and String would be more convenient for our users. I see that type String can be used (rather than the Utf8 default) by two different mechanisms: ( 1) I can override GenericDatumReader.readString, or (2) I can set the "avro.java.string" property for each string field in the schema to "String". I would like to do (1) because it is cleaner (the schema isn't cluttered with metadata that is the same for every string field) and because I don't think information about the object representation logically belongs in the schema (for two users of the same schema, one may use an object representation with String and the other a representation with Utf8). So my question is: Why is the string type a property in the schema, i.e., why does option (2) exist in Avro? Is there something I'm missing about its benefit? Also, if I use option (1), is this likely to cause compatibility problems with other components that process Avro data and Avro schemas, such as Hadoop? Our users may create a schema and store the data for that schema in our database, and then later use the same schema for processing this data in Hadoop. Hadoop is just one example, since one of the reasons we chose Avro is because of its widespread use in many components. Does there typically need to be agreement about the string type among different entities that process data for a shared schema? Thanks in advance for any advice. --mark
