On Dec 2, 2010, at 7:30 AM, David Jeske wrote: I like the inclusion of sort-order in avro, to enable different machines to sort and exchange. I have a few suggestions to clarify the documentation. Please correct any assumptions I've made that are incorrect...
It seems that sorts are not stable across schema versions. I think I understand why this makes sense inside the schema philosophy, yet I think the documentation could clear up a couple of the subtlties a bit more. For example, it says "data items may only be compared if they have identical schemas". If I supply a source schema which avro can map into my target schema, I would think it could load and compare things in my target schema. Is this correct? It might be clarified. There is some need for clarification. As I understand it, things are sorted in the order of the reader's schema, but I may be wrong. If the schema changes, the sort order can change. There is no getting around that. Usually as a schema evolves some things that were formerly different become equal, and some things that were equal become different. Typically, the new schema's definition of order and equivalence is all that matters, so a sort will be consistent, but unstable, with respect to the new schema. But some schema changes will break that (such as changing a field from ascending to descending order, or changing the order that fields are compared). Also, the comment "this permits data written by one system to be efficiently sorted by another system", could callout that data items sorted in one schema may not be in the proper order if during read they are mapped to a new version of the schema. In fact, it might be useful for Avro to be able to tell me when it does the source->target schema mapping, whether both schemas sorted in the same order (if it doesn't already). It would be useful to provide whether the reader/writer schema resolution altered the sort order or not. I don't think we do this. The answer to that question is not as simple as a yes/no answer however. The sort order when migrated from an old schema to a new one may change completely, or it may remain consistent but be unstable from the POV of the new schema, or be both consistent and stable with respect to sorts using the prior schema. Lastly, it says "Note also that Avro binary-encoded data can be efficiently ordered without deserializing it to objects." What does this mean exactly? This might be mis-interpreted as saying one can lexicographically sort the binary-encoding without asking Avro to deserialize it, and it'll be in a proper order. However, this seems obviously not true from the number formats. Perhaps it would be clearer to say "Avro can efficiently make sort-comparisons on binary-encoded data without allocating deserialization objects." Did I properly understand those sort-related subtlties? Yes, perhaps we should say "Avro can efficiently make sort-comparisons on binary data without full deserialization" or something similar.
