Excellent information – thanks

__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory



From: Eric Newton <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Table design

Some thoughts:

Accumulo will accomodate keys that are very large (like 100K) but I don't 
recommend it. It makes indexes big and slows down just about every operation.  
A row-id or column qualifier that is 200 bytes long is not extreme.  Remember 
that compression will decrease the storage requirements, especially since the 
sort creates natural redundancy in the row id.

Is it important to find "Three men and a baby" just after "Three little pigs"?  
If not, hash the title and look up the hash.  That will give you a nice small 
key.  This also avoids hot-spots, like all the titles that start with "The" or 
a common letter, like "S". But you may need to deal with hash collisions.

Counters can give you "append" hot-spots.  As you ingest, the most active 
tablet will always be the newest one.

A random UUID is useful, but large, if you just want a unique identifier 
associated with a title.

Accumulo performance should not change if you have 1 table or 100.  But tables 
are a convenient unit for management.  You can offline, compact and delete a 
table.  You can configure many table-specific properties which can give you 
performance benefits.

-Eric

On Wed, Jun 6, 2012 at 4:46 PM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
Hi,  I am in the process of designing some Accumulo tables for an app and have 
some questions:

Lookup Table:
The data's natural qualifier is a title.  This title can be any length.  Some 
are as long as 200 characters.
I am using this title as a row id and also as a column qualifier in other 
places.
Is it considered good practice to have a lookup table for these titles (like 
RDBMS), replacing them with some incremented integer value, or should I just 
continue to use these long titles as row ids?

Multiple Tables:
What are the best practices around when to create a new table?  I have been 
breaking up my tables based on row id semantics.  For example, title row ids 
are in a different table than row ids based on some analysis count.
Does breaking up data into multiple tables, help/hurt/ or do nothing for 
accumulo performance?

Thanks,
Ralph
__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory



Reply via email to