RE: RowID design and Hive push down

roman.drap...@baesystems.com Mon, 14 Sep 2015 16:31:04 -0700

Hi Adam,

It was a logical partition number calculated by split calculator (based on 
number of servers) – basically yes, just realized that in a new key design this 
approach needs to be rethought.

Thank you
Roman

From: Adam Fuchs [mailto:afu...@apache.org]
Sent: 14 September 2015 23:46
To: user@accumulo.apache.org
Subject: Re: RowID design and Hive push down

Hi Roman,

What's the <payload> used for in your previous key design?

As I'm sure you've figured out, it's generally a bad idea to have a fully 
unique hash in your key, especially if you're trying to support extensive 
secondary indexing. What we've found is that it's not just the size of the key 
but also the compressibility that matters. It's often better to use a one-up 
counter of some sort, regardless of whether you're using a hex encoding or a 
binary encoding. Due to the birthday problem [1] a one-up id generally takes 
less than half of the bytes of a uniformly distributed hash that has low 
probability of collisions, and it will compress much better. Twitter did 
something like that in a distributed fashion that they called Snowflake [2]. 
Google also published about high performance timestamp oracles for transactions 
in their Percolator paper [3].

Cheers,
Adam

[1] https://en.wikipedia.org/wiki/Birthday_problem
[2] https://github.com/twitter/snowflake
[3] http://research.google.com/pubs/pub36726.html

On Mon, Sep 14, 2015 at 2:47 PM, 
roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com> 
<roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com>> wrote:
Hi there,

Our current rowid format is yyyyMMdd_payload_sha256(raw data). It works nicely 
as we have a date and uniqueness guaranteed by hash, however unfortunately, 
rowid is around 50-60 bytes per record.

Requirements are the following:

1)      Support Hive on top of Accumulo for ad-hoc queries

2)      Query original table by date range (e.g rowID < ‘20060101’ AND rowID >= 
‘20060103’) both in code and hive

3)      Additional queries by ~20 different fields

Requirement 3) requires secondary indexes and of course because each RowID is 
50-60 bytes, they become super massive (99% of overall space) and really 
expensive to store.

What we are looking to do is to reduce index size to a fixed size: 
{unixTime}{logicalSplit}{hash}, where unixTime is 4 bytes unsigned integer, 
logicalSplit – 2 bytes unsigned integer, and hash is 4 bytes – overall 10 bytes.

What is unclear to me is how second requirement can be met in Hive as to my 
understanding an in-built RowID push down mechanism won’t work with unsigned 
bytes?

Regards,
Roman

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

RE: RowID design and Hive push down

Reply via email to