Michael,

Interesting contribution to the open source community.  Sounds like nice
work.

Can you say how this relates to Avro with regard to collating of binary
data?

See, for instance, here: http://avro.apache.org/docs/current/spec.html#order

On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <[email protected]> wrote:

> Hi all,
>
> I'm with a startup, GotoMetrics, doing things with Hadoop  and I've gotten
> permission to open source Orderly -- our row key schema system for use in
> projects like HBase. Orderly allows you to serialize common data types
> (long, double, bigdecimal, etc) or structs/records of these types to byte
> arrays, and ensures that the byte arrays sort in the same natural order as
> the data type. You may then use the byte arrays as keys in HBase (or any
> sorted, byte-typed key-value store).
>
>  I'd really appreciate feedback about what parts or useful (or not useful),
> and if this would be something that would be appropriate to submit as a
> contrib to HBase itself (or if people would prefer me to submit derivative
> work to add composite row keys to Hive/Pig/etc).
>
> Here are the interesting features:
>
>   - All types are serialized a byte array that sorts in the natural order
>   of the underlying key for all key values (e.g., an Integer row key will
> sort
>   correctly for negative/positive values, a double will sort correctly for
>   negative/positive/zero/infinity/negative infinity/subnormals/etc - any
> valid
>   value)
>   - Both ascending and descending sort order are supported for all types
>   - Designed for space efficiency - tricks like using the end of a byte
>   array instead of a terminator byte, variable-length types whenever
> possible,
>   etc are all employed to minimize serialization length
>   - Support for row key prefixes/suffixes to combine with your own custom
>   encodings
>   - Variable-length integers (similar in theory to Zig-Zag encoding) are
>   supported, and their byte serialization preserves sort ordering
>   - BigDecimal support (like all other types, with sort ordering-preserving
>   byte serialization). To the best of my knowledge the first byte-sortable
>   BigDecimal serialization.
>   - Float/Double
>   - UTF-8 strings (with support for empty string, NULL, etc)
>   - Almost all types encode NULL, and do so without using additional space
>   (e.g., by using transformation on invalid UTF-8 encodings for Strings,
> NaNs
>   removed during NaN canonicalization for doubles, etc). Null comparess
> less
>   than any non-null value
>   - Support for struct (composite) row keys with an arbitrary number of
>   fields. Each field may have its own sort order. Structs are sorted by
> field
>   value.
>
> I have the code up on github at  http://github.com/mwdalton/orderly. There
> are javadocs for all the row key types explaining their serialization
> format
> and performance characteristics (start with the RowKey and StructRowKey
> docs), as well as example code in src/example.
>
> Please let me know if you have any questions or if there's anything that
> would be useful to add/change. Thanks!
>
> Best regards,
>
> Mike
>

Reply via email to