Re: RowID design and Hive push down

Josh Elser Mon, 14 Sep 2015 16:04:13 -0700

Yup, your decision makes sense.

I was just pointing out that, while Lexicoders are a great way to nothave to think about serialization, that recommendation may not be thebest for the AccumuloStorageHandler (Hive) case.


roman.drap...@baesystems.com wrote:

We really want to avoid writing custom code - rely on a "Standard" approach 
only. We made this mistake in the past. Basically, we wrote our custom 
AccumuloStorageManager a year before it officially appeared and used our custom encoding 
for typization of values - as a result to support Pig/Spark we need to write custom 
storage managers. We decided to migrate to a standard approach and during this process we 
are heavily thinking what would be the most efficient way to reduce the size of the 
indices (as equivalent solution in HDFS/Hive + Elasticsearch) requires 3x less space.



-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 14 September 2015 23:50
To: Drapeko, Roman (UK Guildford)
Cc: user@accumulo.apache.org
Subject: Re: RowID design and Hive push down

I don't think we have a writeup in the user manual (which we really should), 
but Adam recently made a great writeup on how RFile removes duplicate data from 
the file between sequential Keys:

http://mail-archives.apache.org/mod_mbox/accumulo-dev/201509.mbox/%3CCAPMpPc5dP14e0w1%3DU47qd-wyrqhJ0wu6JmrwvpH5OE-e9eCNJQ%40mail.gmail.com%3E

If your query mechanism is via Hive, you'd have to write some custom code to 
use the Lexicoders (an extension to the AccumuloRowSerializer).

roman.drap...@baesystems.com wrote:

Yes, payload + sha256 adds 35 more bytes, so we want to use 4 bytes instead of 
32 for hash but we need second precision (instead of day).

Where can I read about yyyyMMdd prefix removal in Accumulo - don't really 
understand how it supposed to work?

And I have book mentioned below but index does not give me any reference for 
lexicoders..


-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 14 September 2015 23:32
To: Drapeko, Roman (UK Guildford)
Cc: user@accumulo.apache.org
Subject: Re: RowID design and Hive push down

Encoding the year as hex makes your SQL queries a bit uglier if you have a user 
sitting at the endpoint, but it should work on the same principal as the ASCII 
string did.

I am a bit confused why yyyyMMdd stored as an ascii string was
(noticeably?) causing you problems WRT size. I thought that>= Accumulo
1.5 would be eliminating that repetitious prefix. You sure it wasn't the 
"payload_sha256" you had as a suffix that was problematic?

Human readable data (that doesn't sacrifice performance terribly) is always 
more pleasant to work with. Just a thought.

roman.drap...@baesystems.com wrote:

So the most simple solution looks like to represent unix epoch time as 
hexademical string (+4 bytes) and do the same..

-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 14 September 2015 22:37
To: Drapeko, Roman (UK Guildford)
Cc: user@accumulo.apache.org
Subject: Re: RowID design and Hive push down

Yes, the reason the simple approach below would work is before you'd just 
operate on the day boundary (as specified by the yyyyMMdd) and the suffix would 
just naturally fall into the prefix range.

Some code might help draw it together. The comments should bridge the
gap

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/sr
c
/java/org/apache/hadoop/hive/accumulo/predicate/AccumuloRangeGenerato
r
.java#L277

roman.drap...@baesystems.com wrote:

Hi Josh,

Thanks for response.

Well, I am not an expert in Accumulo (so looking for a clue how to implement so 
we avoid as much as possible custom code) - I will try to extend my answer a 
little bit and explain what I don't understand.

For example, if my rowID looks like this: 20060101_blabla

I can query Hive something like that: select * from tbl where rowid>
'20060101' and rowid<     '20060102', to my understanding what's
happening under the hood is AccumuloPredicateHandler  creates a
Range('20060101', '20060102') that used for scanning (?)

Am I correct in saying that AccumuloPredicateHandler always creates a range 
that works with strings only and it's not possible to amend this logic?

Regarding java primitives - it always can be represented as byte[4]

Roman



-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 14 September 2015 21:10
To: user@accumulo.apache.org
Cc: Drapeko, Roman (UK Guildford)
Subject: Re: RowID design and Hive push down

I'm not positive what you mean by the "in-built RowID push down
mechanism won't work with unsigned bytes". Are you saying that
you're trying to change your current rowID structure to
unixTime+logicalSplit+hash structure? And you're trying to evaluate
unixTime+logicalSplit+the
3 listed requirements against the new form?

First off, the Java primitives are signed, so you're going to be
limited by that. Don't forget that.

Have you seen accumulo.composite.rowid from
https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.
Hypothetically, you can provide some logic which will do custom
parsing on your row and generate a struct from the components in your row ID.

Of interest might be:

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/s
r
c
/java/org/apache/hadoop/hive/accumulo/serde/AccumuloRowSerializer.ja
v
a

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/s
r
c
/test/org/apache/hadoop/hive/accumulo/serde/TestAccumuloRowSerializer.
java

You could extend the AccumuloRowSerializer to parse the bytes of the
rowId according to your own spec. I haven't explicitly tried this
myself, but in theory, I think your problems are meant to be solved
by this support. It will take a little bit of effort. Hive's
LazyObject type system is not my favorite framework to work with.
Referencing some of the HBaseStorageHandler code might also be
worthwhile (as the two are very similar).

- Josh

roman.drap...@baesystems.com wrote:

Hi there,

Our current rowid format is yyyyMMdd_payload_sha256(raw data). It
works nicely as we have a date and uniqueness guaranteed by hash,
however unfortunately, rowid is around 50-60 bytes per record.

Requirements are the following:

1)Support Hive on top of Accumulo for ad-hoc queries

2)Query original table by date range (e.g rowID<     '20060101' AND rowID
     >= '20060103') both in code and hive

3)Additional queries by ~20 different fields

Requirement 3) requires secondary indexes and of course because
each RowID is 50-60 bytes, they become super massive (99% of
overall
space) and really expensive to store.

What we are looking to do is to reduce index size to a fixed size:
{unixTime}{logicalSplit}{hash}, where unixTime is 4 bytes unsigned
integer, logicalSplit - 2 bytes unsigned integer, and hash is 4
bytes
- overall 10 bytes.

What is unclear to me is how second requirement can be met in Hive
as to my understanding an in-built RowID push down mechanism won't
work with unsigned bytes?

Regards,

Roman

Please consider the environment before printing this email. This
message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in
hard copy by an authorised signatory. The contents of this email
may relate to dealings with other companies under the control of
BAE Systems Applied Intelligence Limited, details of which can be
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Re: RowID design and Hive push down

Reply via email to