Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Michael Dalton Wed, 13 Apr 2011 23:26:07 -0700

Hi Ted,

Thanks for pointing that out, I hadn't read the Avro sorted specs recently.
It looks like there's some overlap at a high-level (providing byte array
representations that can be sorted without deserialization). After glancing
at BinaryEncoder/Decoder in Avro, it looks to me like the differences are:


   - Avro avoids deserialization when sorting their data, but they use
   custom byte array comparators for different types. All of our encodings,
   including struct/record types, actually sort if you just compare the raw
   bytes using Bytes.compareTo. You can directly use the serialized byte values
   from this project in HBase without requiring HBase to implement its own
   custom comparator functions (it appears that for Avro key support, you'd
   need to parse Avro's schemas and have a custom comparator defined for each
   data type, and this would be used in HBase's sorting functions). You can
   drop Orderly's row keys into HBase without modifying the code base at all.
   - Per the above point, the actual serialization algorithms we use are
   quite different as we can't rely on custom comparator functions -- just
   Bytes.compareTo comparing raw bytes. The serializations end up with similar
   goals (i.e., variable-length zig-zag integers) but the implementation and
   algorithms are very different.
   - Very slightly more compact encodings in certain situations for some
   types -- our Strings don't require an integer length, they use a terminator
   byte, and in ascending sort don't even require the terminator byte. Our
   variable-length integers have some very minor length differences (by a bit
   or two) in some larger variable length long serializations. Probably not
   enough to really matter in all honesty.

Avro is very cool, and for a general serialization and RPC platform it's
definitely fantastic. Orderly is more of a focused solution on producing
byte arrays for use in projects like HBase, without requiring those projects
to integrate a serialization system. If you have any more questions or have
some features I missed that I should be contrasting, let me know.

Best regards,

Mike

On Wed, Apr 13, 2011 at 8:07 PM, Ted Dunning <[email protected]> wrote:

> Michael,
>
> Interesting contribution to the open source community.  Sounds like nice
> work.
>
> Can you say how this relates to Avro with regard to collating of binary
> data?
>
> See, for instance, here:
> http://avro.apache.org/docs/current/spec.html#order
>
>
> On Wed, Apr 13, 2011 at 5:55 PM, Michael Dalton <[email protected]>wrote:
>
>> Hi all,
>>
>> I'm with a startup, GotoMetrics, doing things with Hadoop  and I've gotten
>> permission to open source Orderly -- our row key schema system for use in
>> projects like HBase. Orderly allows you to serialize common data types
>> (long, double, bigdecimal, etc) or structs/records of these types to byte
>> arrays, and ensures that the byte arrays sort in the same natural order as
>> the data type. You may then use the byte arrays as keys in HBase (or any
>> sorted, byte-typed key-value store).
>>
>>  I'd really appreciate feedback about what parts or useful (or not
>> useful),
>> and if this would be something that would be appropriate to submit as a
>> contrib to HBase itself (or if people would prefer me to submit derivative
>> work to add composite row keys to Hive/Pig/etc).
>>
>> Here are the interesting features:
>>
>>   - All types are serialized a byte array that sorts in the natural order
>>   of the underlying key for all key values (e.g., an Integer row key will
>> sort
>>   correctly for negative/positive values, a double will sort correctly for
>>   negative/positive/zero/infinity/negative infinity/subnormals/etc - any
>> valid
>>   value)
>>   - Both ascending and descending sort order are supported for all types
>>   - Designed for space efficiency - tricks like using the end of a byte
>>   array instead of a terminator byte, variable-length types whenever
>> possible,
>>   etc are all employed to minimize serialization length
>>   - Support for row key prefixes/suffixes to combine with your own custom
>>   encodings
>>   - Variable-length integers (similar in theory to Zig-Zag encoding) are
>>   supported, and their byte serialization preserves sort ordering
>>   - BigDecimal support (like all other types, with sort
>> ordering-preserving
>>   byte serialization). To the best of my knowledge the first byte-sortable
>>   BigDecimal serialization.
>>   - Float/Double
>>   - UTF-8 strings (with support for empty string, NULL, etc)
>>   - Almost all types encode NULL, and do so without using additional space
>>   (e.g., by using transformation on invalid UTF-8 encodings for Strings,
>> NaNs
>>   removed during NaN canonicalization for doubles, etc). Null comparess
>> less
>>   than any non-null value
>>   - Support for struct (composite) row keys with an arbitrary number of
>>   fields. Each field may have its own sort order. Structs are sorted by
>> field
>>   value.
>>
>> I have the code up on github at  http://github.com/mwdalton/orderly.
>> There
>> are javadocs for all the row key types explaining their serialization
>> format
>> and performance characteristics (start with the RowKey and StructRowKey
>> docs), as well as example code in src/example.
>>
>> Please let me know if you have any questions or if there's anything that
>> would be useful to add/change. Thanks!
>>
>> Best regards,
>>
>> Mike
>>
>
>

Re: just open sourced Orderly -- a row key schema system (composite keys, etc) for use with HBase

Reply via email to