On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <lars.fran...@gmail.com> wrote: > > I've got a question on the number of column families. I've told everyone > for years that you shouldn't use more than maybe 3-10 column families. > > Our book still says the following: > "HBase currently does not do well with anything above two or three column > families so keep the number of column families in your schema low. > Currently, *flushing* and compactions are done on a per Region basis so if > one column family is carrying the bulk of the data bringing on flushes, the > adjacent families will also be flushed even though the amount of data they > carry is small." > > I'm wondering what the state of the art _really_ is today. > > I know that flushing happens per CF.
Yes. As far as I can tell though > compactions still happen for all stores in a region after a flush. > > Related question there (there's always a good chance that I misread the > code): Wouldn't it make sense to make the compaction decision after a flush > also per Store? > Yes. We compact a CF-at-a-time (looking in logs). CompactionRequest is CF scoped. You reckon we do full Region Lars (I've not dug in). > But back to the original question. How many column families do you see > and/or use in production? And what are the remaining reasons against "a > lot"? > I think the 3-10 is fine as general recommendation. Perhaps caveat that more is also possible but queries should be CF scoped outlining what happens when full-row fetches, especially if the character of the data in each CF varies radically; e.g. one CF has image, while another has metadata. > My list is the following: > - Splits happen per region, so small CFs will be split to be even smaller > - Each CF takes up a few resources even if they are not in use (no reads or > writes) > - If each CF is used then there is an increased total memory pressure which > will probably lead to early flushes which leads to smaller files which > leads to more compactions etc. > - As far as I can tell (but I'm not sure) when a single Store/CF answers > "yes" to the "needsCompaction()" call after a flush the whole region will > be compacted We need to answer this question. I spent five minutes looking in logs and they look to run per-CF. Looking in code, I see generally that we do by CF but there is a top-level method that does all CFs.... used from tests seemingly. What you seeing Lars? If we compact all CFs when a Compaction runs, thats a bug. Thanks, S > - Each CF creates a directory + files per region -> might lead to lots of > small files > This is done lazily. > I'd love to update the book when I have some answers. > Thanks Lars, S > Thank you! > > Cheers, > Lars