On Dec 2, 2010, at 7:30 AM, David Jeske wrote:

I like the inclusion of sort-order in avro, to enable different machines to 
sort and exchange. I have a few suggestions to clarify the documentation. 
Please correct any assumptions I've made that are incorrect...

It seems that sorts are not stable across schema versions. I think I understand 
why this makes sense inside the schema philosophy, yet I think the 
documentation could clear up a couple of the subtlties a bit more. For example, 
it says "data items may only be compared if they have identical schemas". If I 
supply a source schema which avro can map into my target schema, I would think 
it could load and compare things in my target schema. Is this correct? It might 
be clarified.

There is some need for clarification.  As I understand it, things are sorted in 
the order of the reader's schema, but I may be wrong.  If the schema changes, 
the sort order can change.  There is no getting around that.  Usually as a 
schema evolves some things that were formerly different become equal, and some 
things that were equal become different.  Typically, the new schema's 
definition of order and equivalence is all that matters, so a sort will be 
consistent, but unstable, with respect to the new schema.  But some schema 
changes will break that (such as changing a field from ascending to descending 
order, or changing the order that fields are compared).


Also, the comment "this permits data written by one system to be efficiently 
sorted by another system", could callout that data items sorted in one schema 
may not be in the proper order if during read they are mapped to a new version 
of the schema. In fact, it might be useful for Avro to be able to tell me when 
it does the source->target schema mapping, whether both schemas sorted in the 
same order (if it doesn't already).

It would be useful to provide whether the reader/writer schema resolution 
altered the sort order or not.  I don't think we do this. The answer to that 
question is not as simple as a yes/no answer however.  The sort order when 
migrated from an old schema to a new one may change completely, or it may 
remain consistent but be unstable from the POV of the new schema, or be both 
consistent and stable with respect to sorts using the prior schema.



Lastly, it says "Note also that Avro binary-encoded data can be efficiently 
ordered without deserializing it to objects." What does this mean exactly?  
This might be mis-interpreted as saying one can lexicographically sort the 
binary-encoding without asking Avro to deserialize it, and it'll be in a proper 
order. However, this seems obviously not true from the number formats. Perhaps 
it would be clearer to say "Avro can efficiently make sort-comparisons on 
binary-encoded data without allocating deserialization objects."

Did I properly understand those sort-related subtlties?

Yes, perhaps we should say "Avro can efficiently make sort-comparisons on 
binary data without full deserialization" or something similar.

Reply via email to