Michael, Interesting contribution to the open source community. Sounds like nice work.
Can you say how this relates to Avro with regard to collating of binary data? See, for instance, here: http://avro.apache.org/docs/current/spec.html#order On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <[email protected]> wrote: > Hi all, > > I'm with a startup, GotoMetrics, doing things with Hadoop and I've gotten > permission to open source Orderly -- our row key schema system for use in > projects like HBase. Orderly allows you to serialize common data types > (long, double, bigdecimal, etc) or structs/records of these types to byte > arrays, and ensures that the byte arrays sort in the same natural order as > the data type. You may then use the byte arrays as keys in HBase (or any > sorted, byte-typed key-value store). > > I'd really appreciate feedback about what parts or useful (or not useful), > and if this would be something that would be appropriate to submit as a > contrib to HBase itself (or if people would prefer me to submit derivative > work to add composite row keys to Hive/Pig/etc). > > Here are the interesting features: > > - All types are serialized a byte array that sorts in the natural order > of the underlying key for all key values (e.g., an Integer row key will > sort > correctly for negative/positive values, a double will sort correctly for > negative/positive/zero/infinity/negative infinity/subnormals/etc - any > valid > value) > - Both ascending and descending sort order are supported for all types > - Designed for space efficiency - tricks like using the end of a byte > array instead of a terminator byte, variable-length types whenever > possible, > etc are all employed to minimize serialization length > - Support for row key prefixes/suffixes to combine with your own custom > encodings > - Variable-length integers (similar in theory to Zig-Zag encoding) are > supported, and their byte serialization preserves sort ordering > - BigDecimal support (like all other types, with sort ordering-preserving > byte serialization). To the best of my knowledge the first byte-sortable > BigDecimal serialization. > - Float/Double > - UTF-8 strings (with support for empty string, NULL, etc) > - Almost all types encode NULL, and do so without using additional space > (e.g., by using transformation on invalid UTF-8 encodings for Strings, > NaNs > removed during NaN canonicalization for doubles, etc). Null comparess > less > than any non-null value > - Support for struct (composite) row keys with an arbitrary number of > fields. Each field may have its own sort order. Structs are sorted by > field > value. > > I have the code up on github at http://github.com/mwdalton/orderly. There > are javadocs for all the row key types explaining their serialization > format > and performance characteristics (start with the RowKey and StructRowKey > docs), as well as example code in src/example. > > Please let me know if you have any questions or if there's anything that > would be useful to add/change. Thanks! > > Best regards, > > Mike >
