On 05/23/2012 09:10 PM, Mark Hayes wrote:
So my question is:  Why is the string type a property in the schema,
i.e., why does option (2) exist in Avro?  Is there something I'm missing
about its benefit?

It's for back compatibility. Strings in specific and generic representations were originally always read as Utf8, so many existing applications expect strings to be Utf8. Rather than breaking all of these applications we instead permitted folks to opt in to this change. For applications that use the specific representation (those that generate code) and wish to change from Utf8 to String it requires only adding a single parameter to their Maven configuration, so it's not very invasive. The runtime must know which representation is desired for strings, and the Schema is the convenient runtime structure to annotate.

Note that we'd prefer not to instead make it a property of the Encoder/Decoder or DatumWriter/DatumReader since we permit folks to intermix reflect, specific and generic objects in a tree. For example, one may have a reflected datum that has some fields which are defined by generated specific classes and other fields which correspond to no class on the classpath so the generic representation is used. This flexibility permits classes like org.apache.avro.mapred.Pair<X,Y>, which can contain reflect, specific or generic instances.

Also, if I use option (1), is this likely to cause compatibility
problems with other components that process Avro data and Avro schemas,
such as Hadoop?

No, I don't think so. If you use your own DatumReader implementation to read your data then that should not affect anyone else. Reflect, specific and generic inherit from one another, sharing many parts of their implementation, so changes to these must keep the others in mind, but if you've defined a new DatumReader that's only used to read your data that should not affect any other applications.

Our users may create a schema and store the data for
that schema in our database, and then later use the same schema for
processing this data in Hadoop.  Hadoop is just one example, since one
of the reasons we chose Avro is because of its widespread use in many
components.  Does there typically need to be agreement about the string
type among different entities that process data for a shared schema?

Not really. If you're reading things that correspond to a generated specific class then it will always use the representation it expects, since generated code contains its schema. If you use reflection to read things into instances of a non-generated class then it will generally read strings as java.lang.String. The generic representation will use Utf8 for unannotated string schemas. Your map and reduce functions will need to be written accordingly.

I hope this helps!

Doug

Reply via email to