Thank you to all responders. This clears it up greatly. Dave P
On Wed, May 27, 2015 at 10:52 AM, Christopher <[email protected]> wrote: > David- > > Both the column family (CF) and column qualifier (CQ) could be thought of > as arbitrary dimensions in the key. If you only need one dimension to > specify your data, the other can be empty. You could also store these in > separate tables, as you suggest, but part of the power of Accumulo is that > you don't actually need to separate your data this way. You can keep it in > the same table, organized by CF and, as Andrew alludes to, you can store > specific CFs in a particular locality group for faster access when querying > just data in those CFs. > > If you have only one category of data, I'd recommend storing the specifier > in the CQ, not the CF. Although they look like they would be equivalent for > this case, you're more resilient to future changes if you use the CQ, > because now you can reserve the CF for later changes to the schema you're > using or for another kind of data you want to mix in later. > > In addition, I would recommend using the CF to store data from a finite > set (such as "one of {STRING, DATE, INT}" or "one of {CategoryA, > CategoryB}" or "one of {Schema1, Schema2}", etc.), while you can use the CQ > to store arbitrary data (such as "<date>", "<number>", "<name>", etc.). The > reason for this is that locality groups, should you ever decide to use them > at some point in the future, can only (currently) be specified as a finite > discrete set of CFs, and not a pattern or other predicate. So, not storing > arbitrary data in the CFs will leave that option available to you. > > Basically, you are right that you can use either, or both for your column > names, but there's a few good practices which might help you decide which > is better to use for your data. > > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > On Wed, May 27, 2015 at 9:17 AM, Andrew Wells <[email protected]> > wrote: > >> On the surface it adds an additional level of specification/grouping. >> >> The potential benefit we have in accumulo is that along with the fact >> that identical rowID's are guaranteed to be in the same file. You can use >> Locality Groups, to place specific Column Families into the same file as >> well. Providing faster scans when looking for a specific column family. >> >> >> >> On Wed, May 27, 2015 at 9:05 AM, David Patterson <[email protected]> >> wrote: >> >>> I've been trying to understand the difference between the two column >>> name parts -- column family and column qualifier. I don't understand the >>> value of using the columnFamily for the column name and an "empty text" >>> (new Text(new byte[0])) field for the column qualifier vs. a non-unique >>> column name and the distinct column name in the column qualifier position. >>> >>> >>> I can sort-of understand the distinction if I have multiple distinct >>> kinds of data in my data collection. I could use the column family part to >>> determine how to interpret the rest of the data (what columns I can expect, >>> etc.). But, that kind of data could also be handled with multiple databases. >>> >>> Any guidance would be appreciated. >>> >>> Thanks. >>> >>> Davie Patterson >>> >> >> >> >> -- >> *Andrew George Wells* >> *Software Engineer* >> *[email protected] <[email protected]>* >> >> >
