Excellent information – thanks __________________________________________________ Ralph Perko Pacific Northwest National Laboratory
From: Eric Newton <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Table design Some thoughts: Accumulo will accomodate keys that are very large (like 100K) but I don't recommend it. It makes indexes big and slows down just about every operation. A row-id or column qualifier that is 200 bytes long is not extreme. Remember that compression will decrease the storage requirements, especially since the sort creates natural redundancy in the row id. Is it important to find "Three men and a baby" just after "Three little pigs"? If not, hash the title and look up the hash. That will give you a nice small key. This also avoids hot-spots, like all the titles that start with "The" or a common letter, like "S". But you may need to deal with hash collisions. Counters can give you "append" hot-spots. As you ingest, the most active tablet will always be the newest one. A random UUID is useful, but large, if you just want a unique identifier associated with a title. Accumulo performance should not change if you have 1 table or 100. But tables are a convenient unit for management. You can offline, compact and delete a table. You can configure many table-specific properties which can give you performance benefits. -Eric On Wed, Jun 6, 2012 at 4:46 PM, Perko, Ralph J <[email protected]<mailto:[email protected]>> wrote: Hi, I am in the process of designing some Accumulo tables for an app and have some questions: Lookup Table: The data's natural qualifier is a title. This title can be any length. Some are as long as 200 characters. I am using this title as a row id and also as a column qualifier in other places. Is it considered good practice to have a lookup table for these titles (like RDBMS), replacing them with some incremented integer value, or should I just continue to use these long titles as row ids? Multiple Tables: What are the best practices around when to create a new table? I have been breaking up my tables based on row id semantics. For example, title row ids are in a different table than row ids based on some analysis count. Does breaking up data into multiple tables, help/hurt/ or do nothing for accumulo performance? Thanks, Ralph __________________________________________________ Ralph Perko Pacific Northwest National Laboratory
