On Tue, Nov 6, 2012 at 11:01 AM, Sukant Hajra <[email protected]>wrote:
> I've been trying to understand Accumulo more deeply as we use it more. To > supplement the on-line documentation and source, I've been referencing some > blog articles on HBase (Lars George has some ones), HBase docs, and the > BigTable paper. > > But I'm curious about some of the deviations of Accumulo from BigTable and > HBase. > > The questions I have right now are: > > 1. Is the format of an RFile close to HFile version 1, HFile version > 2, or > at this point is the format really it's own thing? I found good > documentation on the HFile, but I haven't yet found similar > documentation > on RFiles. There's the source code, but I haven't dug into that yet. > I think there is a different HFile for each column family, isn't there? An RFile stores all columns, all locality groups in a single file, which is another reason you don't get the same performance penalty for having lots of column families in Accumulo. > > 2. I understand that HBase doesn't do well with too many column > families. > However, creating too many column families in HBase isn't likely anyway > because you can't (I believe) create them dynamically. Accumulo > allows you > to create column families dynamically. But I wonder if this can come > at a > cost. Is there a benefit to using column families less frequently if > possible in Accumulo? Or is the cost of using column families more or > less > the same as using column qualifiers. > > 3. I guess one way families might be different from qualifiers relates > to > HBase's recommendation to keep column family names short to avoid > needless > storage waste. That should apply to Accumulo as well, right? > > 4. In supporting dynamic column families, was there a design trade-off > with > respect to the original BigTable or current HBase design? What might > be a > benefit of doing it the other way? > The main thing Accumulo had to do differently from BigTable to allow dynamic creation of column families was to create a default locality group. That's the locality group that stores column families that aren't specified for any other locality group. I recall Keith saying it was kind of a pain to implement, but I don't see any obvious negative tradeoffs of the design. Billie > Thanks, > Sukant >
