I've got a question on the number of column families. I've told everyone
for years that you shouldn't use more than maybe 3-10 column families.

Our book still says the following:
"HBase currently does not do well with anything above two or three column
families so keep the number of column families in your schema low.
Currently, *flushing* and compactions are done on a per Region basis so if
one column family is carrying the bulk of the data bringing on flushes, the
adjacent families will also be flushed even though the amount of data they
carry is small."

I'm wondering what the state of the art _really_ is today.

I know that flushing happens per CF. As far as I can tell though
compactions still happen for all stores in a region after a flush.

Related question there (there's always a good chance that I misread the
code): Wouldn't it make sense to make the compaction decision after a flush
also per Store?

But back to the original question. How many column families do you see
and/or use in production? And what are the remaining reasons against "a
lot"?

My list is the following:
- Splits happen per region, so small CFs will be split to be even smaller
- Each CF takes up a few resources even if they are not in use (no reads or
writes)
- If each CF is used then there is an increased total memory pressure which
will probably lead to early flushes which leads to smaller files which
leads to more compactions etc.
- As far as I can tell (but I'm not sure) when a single Store/CF answers
"yes" to the "needsCompaction()" call after a flush the whole region will
be compacted
- Each CF creates a directory + files per region -> might lead to lots of
small files

I'd love to update the book when I have some answers.

Thank you!

Cheers,
Lars

Reply via email to