Yes, you will be wasting some IO, this is a well known bug in HBase, but it's not because empty families would be flushes. In HBase, usually if something is empty it means it doesn't exist (that's why sparse columns are free). Now if you insert in 4 families in different rows but all in the same region, then it flushes on the aggregate size of all the families instead of flushing them individually. Let's say you load them unevenly, then you could end up with 3 files of a few KBs and a big 63MB file. Repeat that a few times, and you'll be compacting those small files with other small files until you get bigger ones, and you will still compact them with small files. That's where the waste is, you want to flush/compact as less as possible.
J-D On Thu, May 19, 2011 at 8:25 AM, Wayne <[email protected]> wrote: > How about Column Families? We have 4 column families per table due to > different settings (versions etc.). They are sparse in that a given row will > only ever write to a single CF and even regions usually have only 1 CF's > data/store file except at the border between row key naming conventions > (each CF has its own convention). I recently read in the online book (see > below) how more CFs are bad and you should stick with only 1. Is this true > given that there is only really ever data for one CF in a given region? Are > we wasting disk i/o and memory because of empty CFs being flushed and > compacted? > > Thanks as always Stack for your help. > 8.2. On the number of column families > > HBase currently does not do well with anything about two or three column > families so keep the number of column families in your schema low. > Currently, flushing and compactions are done on a per Region basis so if one > column family is carrying the bulk of the data bringing on flushes, the > adjacent families will also be flushed though the amount of data they carry > is small. Compaction is currently triggered by the total number of files > under a column family. Its not size based. When many column families the > flushing and compaction interaction can make for a bunch of needless i/o > loading (To be addressed by changing flushing and compaction to work on a > per column family basis). > > Try to make do with one column famliy if you can in your schemas. Only > introduce a second and third column family in the case where data access is > usually column scoped; i.e. you query one column family or the other but > usually not both at the one time. > > On Wed, May 18, 2011 at 10:46 AM, Stack <[email protected]> wrote: > >> Its not the number of tables that is of import, its the number of >> regions. You can have your regions in as many tables as you like. I >> do not believe there a cost to having more tables. >> >> St.Ack >> >> On Wed, May 18, 2011 at 5:54 AM, Wayne <[email protected]> wrote: >> > How many tables can a cluster realistically handle or how many >> tables/node >> > can be supported? I am looking for a realistic idea of whether a 10 node >> > cluster can support 100 or even 500 tables. I realize it is recommended >> to >> > have a few tables at most (and to use the row key to add everything to >> one >> > table), but that is not an option for us at this point. What are the >> > settings that need to be tweaked and where are the issues going to occur >> in >> > terms of resource limitations, memory constraints, and OOM problems? Do >> most >> > resource limitations fall back to total active region count regardless of >> > the table count? Where do things get scary in terms of a large numbers of >> > tables? >> > >> > Thanks in advance for any advice that can be provided. >> > >> >
