Re: Why is the String type a Schema property?

Doug Cutting Thu, 24 May 2012 09:08:46 -0700

On 05/23/2012 09:10 PM, Mark Hayes wrote:

So my question is:  Why is the string type a property in the schema,
i.e., why does option (2) exist in Avro?  Is there something I'm missing
about its benefit?

It's for back compatibility. Strings in specific and genericrepresentations were originally always read as Utf8, so many existingapplications expect strings to be Utf8. Rather than breaking all ofthese applications we instead permitted folks to opt in to this change.For applications that use the specific representation (those thatgenerate code) and wish to change from Utf8 to String it requires onlyadding a single parameter to their Maven configuration, so it's not veryinvasive. The runtime must know which representation is desired forstrings, and the Schema is the convenient runtime structure to annotate.

Note that we'd prefer not to instead make it a property of theEncoder/Decoder or DatumWriter/DatumReader since we permit folks tointermix reflect, specific and generic objects in a tree. For example,one may have a reflected datum that has some fields which are defined bygenerated specific classes and other fields which correspond to no classon the classpath so the generic representation is used. Thisflexibility permits classes like org.apache.avro.mapred.Pair<X,Y>, whichcan contain reflect, specific or generic instances.

Also, if I use option (1), is this likely to cause compatibility
problems with other components that process Avro data and Avro schemas,
such as Hadoop?

No, I don't think so. If you use your own DatumReader implementation toread your data then that should not affect anyone else. Reflect,specific and generic inherit from one another, sharing many parts oftheir implementation, so changes to these must keep the others in mind,but if you've defined a new DatumReader that's only used to read yourdata that should not affect any other applications.

Our users may create a schema and store the data for
that schema in our database, and then later use the same schema for
processing this data in Hadoop.  Hadoop is just one example, since one
of the reasons we chose Avro is because of its widespread use in many
components.  Does there typically need to be agreement about the string
type among different entities that process data for a shared schema?

Not really. If you're reading things that correspond to a generatedspecific class then it will always use the representation it expects,since generated code contains its schema. If you use reflection to readthings into instances of a non-generated class then it will generallyread strings as java.lang.String. The generic representation will useUtf8 for unannotated string schemas. Your map and reduce functions willneed to be written accordingly.


I hope this helps!

Doug

Re: Why is the String type a Schema property?

Reply via email to