How about Column Families? We have 4 column families per table due to different settings (versions etc.). They are sparse in that a given row will only ever write to a single CF and even regions usually have only 1 CF's data/store file except at the border between row key naming conventions (each CF has its own convention). I recently read in the online book (see below) how more CFs are bad and you should stick with only 1. Is this true given that there is only really ever data for one CF in a given region? Are we wasting disk i/o and memory because of empty CFs being flushed and compacted?
Thanks as always Stack for your help. 8.2. On the number of column families HBase currently does not do well with anything about two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. Compaction is currently triggered by the total number of files under a column family. Its not size based. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis). Try to make do with one column famliy if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time. On Wed, May 18, 2011 at 10:46 AM, Stack <[email protected]> wrote: > Its not the number of tables that is of import, its the number of > regions. You can have your regions in as many tables as you like. I > do not believe there a cost to having more tables. > > St.Ack > > On Wed, May 18, 2011 at 5:54 AM, Wayne <[email protected]> wrote: > > How many tables can a cluster realistically handle or how many > tables/node > > can be supported? I am looking for a realistic idea of whether a 10 node > > cluster can support 100 or even 500 tables. I realize it is recommended > to > > have a few tables at most (and to use the row key to add everything to > one > > table), but that is not an option for us at this point. What are the > > settings that need to be tweaked and where are the issues going to occur > in > > terms of resource limitations, memory constraints, and OOM problems? Do > most > > resource limitations fall back to total active region count regardless of > > the table count? Where do things get scary in terms of a large numbers of > > tables? > > > > Thanks in advance for any advice that can be provided. > > >
