Couple of clarifications:

* Identical rowIDs will colocate data in the same tablet, but not necessarily the same file. Tablets can have multiple files.

* Locality groups will colocate data within a file, not necessarily in its own file. RFile's format support multiple "regions" within the file which correspond to locality groups.

To David's original question, I like to think of the family/qualifier breakdown in the general case as follows: the family is used for a coarse grouping of similar data while the qualifier is used as some name/identifier for the value.

Accumulo's flexibility in how the data model is implemented (specifically the ability to store any column family in a table via the default locality group), lets you implement much more advanced "schemas" in Accumulo, but the above is definitely the "typical" case if you look to "BigTable" use in general IMO.

Andrew Wells wrote:
On the surface it adds an additional level of specification/grouping.

The potential benefit we have in accumulo is that along with the fact
that identical rowID's are guaranteed to be in the same file. You can
use Locality Groups, to place specific Column Families into the same
file as well. Providing faster scans when looking for a specific column
family.



On Wed, May 27, 2015 at 9:05 AM, David Patterson <[email protected]
<mailto:[email protected]>> wrote:

    I've been trying to understand the difference between the two column
    name parts -- column family and column qualifier. I don't understand
    the value of using the columnFamily for the column name and an
    "empty text" (new Text(new byte[0])) field for the column qualifier
    vs. a non-unique column name and the distinct column name in the
    column qualifier position.


    I can sort-of understand the distinction if I have multiple distinct
    kinds of data in my data collection. I could use the column family
    part to determine how to interpret the rest of the data (what
    columns I can expect, etc.). But, that kind of data could also be
    handled with multiple databases.

    Any guidance would be appreciated.

    Thanks.

    Davie Patterson




--
*Andrew George Wells*
*Software Engineer*
*[email protected] <mailto:[email protected]>*

Reply via email to