Hi Ted, Thanks for pointing that out, I hadn't read the Avro sorted specs recently. It looks like there's some overlap at a high-level (providing byte array representations that can be sorted without deserialization). After glancing at BinaryEncoder/Decoder in Avro, it looks to me like the differences are:
- Avro avoids deserialization when sorting their data, but they use custom byte array comparators for different types. All of our encodings, including struct/record types, actually sort if you just compare the raw bytes using Bytes.compareTo. You can directly use the serialized byte values from this project in HBase without requiring HBase to implement its own custom comparator functions (it appears that for Avro key support, you'd need to parse Avro's schemas and have a custom comparator defined for each data type, and this would be used in HBase's sorting functions). You can drop Orderly's row keys into HBase without modifying the code base at all. - Per the above point, the actual serialization algorithms we use are quite different as we can't rely on custom comparator functions -- just Bytes.compareTo comparing raw bytes. The serializations end up with similar goals (i.e., variable-length zig-zag integers) but the implementation and algorithms are very different. - Very slightly more compact encodings in certain situations for some types -- our Strings don't require an integer length, they use a terminator byte, and in ascending sort don't even require the terminator byte. Our variable-length integers have some very minor length differences (by a bit or two) in some larger variable length long serializations. Probably not enough to really matter in all honesty. Avro is very cool, and for a general serialization and RPC platform it's definitely fantastic. Orderly is more of a focused solution on producing byte arrays for use in projects like HBase, without requiring those projects to integrate a serialization system. If you have any more questions or have some features I missed that I should be contrasting, let me know. Best regards, Mike On Wed, Apr 13, 2011 at 8:07 PM, Ted Dunning <[email protected]> wrote: > Michael, > > Interesting contribution to the open source community. Sounds like nice > work. > > Can you say how this relates to Avro with regard to collating of binary > data? > > See, for instance, here: > http://avro.apache.org/docs/current/spec.html#order > > > On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <[email protected]>wrote: > >> Hi all, >> >> I'm with a startup, GotoMetrics, doing things with Hadoop and I've gotten >> permission to open source Orderly -- our row key schema system for use in >> projects like HBase. Orderly allows you to serialize common data types >> (long, double, bigdecimal, etc) or structs/records of these types to byte >> arrays, and ensures that the byte arrays sort in the same natural order as >> the data type. You may then use the byte arrays as keys in HBase (or any >> sorted, byte-typed key-value store). >> >> I'd really appreciate feedback about what parts or useful (or not >> useful), >> and if this would be something that would be appropriate to submit as a >> contrib to HBase itself (or if people would prefer me to submit derivative >> work to add composite row keys to Hive/Pig/etc). >> >> Here are the interesting features: >> >> - All types are serialized a byte array that sorts in the natural order >> of the underlying key for all key values (e.g., an Integer row key will >> sort >> correctly for negative/positive values, a double will sort correctly for >> negative/positive/zero/infinity/negative infinity/subnormals/etc - any >> valid >> value) >> - Both ascending and descending sort order are supported for all types >> - Designed for space efficiency - tricks like using the end of a byte >> array instead of a terminator byte, variable-length types whenever >> possible, >> etc are all employed to minimize serialization length >> - Support for row key prefixes/suffixes to combine with your own custom >> encodings >> - Variable-length integers (similar in theory to Zig-Zag encoding) are >> supported, and their byte serialization preserves sort ordering >> - BigDecimal support (like all other types, with sort >> ordering-preserving >> byte serialization). To the best of my knowledge the first byte-sortable >> BigDecimal serialization. >> - Float/Double >> - UTF-8 strings (with support for empty string, NULL, etc) >> - Almost all types encode NULL, and do so without using additional space >> (e.g., by using transformation on invalid UTF-8 encodings for Strings, >> NaNs >> removed during NaN canonicalization for doubles, etc). Null comparess >> less >> than any non-null value >> - Support for struct (composite) row keys with an arbitrary number of >> fields. Each field may have its own sort order. Structs are sorted by >> field >> value. >> >> I have the code up on github at http://github.com/mwdalton/orderly. >> There >> are javadocs for all the row key types explaining their serialization >> format >> and performance characteristics (start with the RowKey and StructRowKey >> docs), as well as example code in src/example. >> >> Please let me know if you have any questions or if there's anything that >> would be useful to add/change. Thanks! >> >> Best regards, >> >> Mike >> > >
