Re: Question about best practices on column names

David Patterson Wed, 27 May 2015 08:06:18 -0700

Thank you to all responders. This clears it up greatly.

Dave P


On Wed, May 27, 2015 at 10:52 AM, Christopher <[email protected]> wrote:

> David-
>
> Both the column family (CF) and column qualifier (CQ) could be thought of
> as arbitrary dimensions in the key. If you only need one dimension to
> specify your data, the other can be empty. You could also store these in
> separate tables, as you suggest, but part of the power of Accumulo is that
> you don't actually need to separate your data this way. You can keep it in
> the same table, organized by CF and, as Andrew alludes to, you can store
> specific CFs in a particular locality group for faster access when querying
> just data in those CFs.
>
> If you have only one category of data, I'd recommend storing the specifier
> in the CQ, not the CF. Although they look like they would be equivalent for
> this case, you're more resilient to future changes if you use the CQ,
> because now you can reserve the CF for later changes to the schema you're
> using or for another kind of data you want to mix in later.
>
> In addition, I would recommend using the CF to store data from a finite
> set (such as "one of {STRING, DATE, INT}" or "one of {CategoryA,
> CategoryB}" or "one of {Schema1, Schema2}", etc.), while you can use the CQ
> to store arbitrary data (such as "<date>", "<number>", "<name>", etc.). The
> reason for this is that locality groups, should you ever decide to use them
> at some point in the future, can only (currently) be specified as a finite
> discrete set of CFs, and not a pattern or other predicate. So, not storing
> arbitrary data in the CFs will leave that option available to you.
>
> Basically, you are right that you can use either, or both for your column
> names, but there's a few good practices which might help you decide which
> is better to use for your data.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Wed, May 27, 2015 at 9:17 AM, Andrew Wells <[email protected]>
> wrote:
>
>> On the surface it adds an additional level of specification/grouping.
>>
>> The potential benefit we have in accumulo is that along with the fact
>> that identical rowID's are guaranteed to be in the same file. You can use
>> Locality Groups, to place specific Column Families into the same file as
>> well. Providing faster scans when looking for a specific column family.
>>
>>
>>
>> On Wed, May 27, 2015 at 9:05 AM, David Patterson <[email protected]>
>> wrote:
>>
>>> I've been trying to understand the difference between the two column
>>> name parts -- column family and column qualifier. I don't understand the
>>> value of using the columnFamily for the column name and an "empty text"
>>> (new Text(new byte[0])) field for the column qualifier vs. a non-unique
>>> column name and the distinct column name in the column qualifier position.
>>>
>>>
>>> I can sort-of understand the distinction if I have multiple distinct
>>> kinds of data in my data collection. I could use the column family part to
>>> determine how to interpret the rest of the data (what columns I can expect,
>>> etc.). But, that kind of data could also be handled with multiple databases.
>>>
>>> Any guidance would be appreciated.
>>>
>>> Thanks.
>>>
>>> Davie Patterson
>>>
>>
>>
>>
>> --
>> *Andrew George Wells*
>> *Software Engineer*
>> *[email protected] <[email protected]>*
>>
>>
>

Re: Question about best practices on column names

Reply via email to