On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <lars.fran...@gmail.com> wrote:
>
> I've got a question on the number of column families. I've told everyone
> for years that you shouldn't use more than maybe 3-10 column families.
>
> Our book still says the following:
> "HBase currently does not do well with anything above two or three column
> families so keep the number of column families in your schema low.
> Currently, *flushing* and compactions are done on a per Region basis so if
> one column family is carrying the bulk of the data bringing on flushes, the
> adjacent families will also be flushed even though the amount of data they
> carry is small."
>
> I'm wondering what the state of the art _really_ is today.
>
> I know that flushing happens per CF.

Yes.


As far as I can tell though
> compactions still happen for all stores in a region after a flush.
>
> Related question there (there's always a good chance that I misread the
> code): Wouldn't it make sense to make the compaction decision after a flush
> also per Store?
>

Yes.

We compact a CF-at-a-time (looking in logs). CompactionRequest is CF
scoped. You reckon we do full Region Lars (I've not dug in).


> But back to the original question. How many column families do you see
> and/or use in production? And what are the remaining reasons against "a
> lot"?
>

I think the 3-10 is fine as general recommendation. Perhaps caveat
that more is also possible but queries should be CF scoped outlining
what happens when full-row fetches, especially if the character of the
data in each CF varies radically; e.g. one CF has image, while another
has metadata.


> My list is the following:
> - Splits happen per region, so small CFs will be split to be even smaller
> - Each CF takes up a few resources even if they are not in use (no reads or
> writes)
> - If each CF is used then there is an increased total memory pressure which
> will probably lead to early flushes which leads to smaller files which
> leads to more compactions etc.
> - As far as I can tell (but I'm not sure) when a single Store/CF answers
> "yes" to the "needsCompaction()" call after a flush the whole region will
> be compacted

We need to answer this question. I spent five minutes looking in logs
and they look to run per-CF. Looking in code, I see generally that we
do by CF but there is a top-level method that does all CFs.... used
from tests seemingly.

What you seeing Lars?

If we compact all CFs when a Compaction runs, thats a bug.

Thanks,
S



> - Each CF creates a directory + files per region -> might lead to lots of
> small files
>

This is done lazily.

> I'd love to update the book when I have some answers.
>

Thanks Lars,
S


> Thank you!
>
> Cheers,
> Lars

Reply via email to