Hi all, I'm with a startup, GotoMetrics, doing things with Hadoop and I've gotten permission to open source Orderly -- our row key schema system for use in projects like HBase. Orderly allows you to serialize common data types (long, double, bigdecimal, etc) or structs/records of these types to byte arrays, and ensures that the byte arrays sort in the same natural order as the data type. You may then use the byte arrays as keys in HBase (or any sorted, byte-typed key-value store).
I'd really appreciate feedback about what parts or useful (or not useful), and if this would be something that would be appropriate to submit as a contrib to HBase itself (or if people would prefer me to submit derivative work to add composite row keys to Hive/Pig/etc). Here are the interesting features: - All types are serialized a byte array that sorts in the natural order of the underlying key for all key values (e.g., an Integer row key will sort correctly for negative/positive values, a double will sort correctly for negative/positive/zero/infinity/negative infinity/subnormals/etc - any valid value) - Both ascending and descending sort order are supported for all types - Designed for space efficiency - tricks like using the end of a byte array instead of a terminator byte, variable-length types whenever possible, etc are all employed to minimize serialization length - Support for row key prefixes/suffixes to combine with your own custom encodings - Variable-length integers (similar in theory to Zig-Zag encoding) are supported, and their byte serialization preserves sort ordering - BigDecimal support (like all other types, with sort ordering-preserving byte serialization). To the best of my knowledge the first byte-sortable BigDecimal serialization. - Float/Double - UTF-8 strings (with support for empty string, NULL, etc) - Almost all types encode NULL, and do so without using additional space (e.g., by using transformation on invalid UTF-8 encodings for Strings, NaNs removed during NaN canonicalization for doubles, etc). Null comparess less than any non-null value - Support for struct (composite) row keys with an arbitrary number of fields. Each field may have its own sort order. Structs are sorted by field value. I have the code up on github at http://github.com/mwdalton/orderly. There are javadocs for all the row key types explaining their serialization format and performance characteristics (start with the RowKey and StructRowKey docs), as well as example code in src/example. Please let me know if you have any questions or if there's anything that would be useful to add/change. Thanks! Best regards, Mike
