On Thu, Feb 2, 2012 at 4:47 PM, Bryan Beaudreault <[email protected]> wrote: > I'd love to hear from an expert on the pros and cons of big tables vs many > tables, when access patterns and simplicity are not a concern[1]. I > haven't found much information regarding it, but I'd imagine the only > benefit to many tables is the ability to configure each differently if that > is helpful for the use case.
HBase doesn't offer a whole lot of configuration knobs per table. Most table I come across have the same configuration: single family, LZO compression, some form of Bloom filter. Maybe VERSIONS=>1. If you need different configs, you can also consider using multiple column families in a single table. If you have somewhat related data and you're on the fence when trying to decide whether you store everything in a single table or not, I generally recommend to stick to a single table. From an operational standpoint, it's easier to manage a single table for an application than multiple ones. You also generally end up with fewer, bigger regions, which is almost always better. This entails that your RS are writing more data to fewer WALs, which leads to more sequential writes across the board. You'll end up with fewer HLogs, which is also a good thing. As others said, with a single table design, you can control data locality, but as soon as you write to and read from multiple tables, all bets are off. If you use HBase's client (which is most likely the case as the only other alternative is asynchbase), beware that you need to create one HTable instance per table per thread in your application code. If you build an application with many tables, this rapidly becomes unwieldy. If you use asynchbase you don't have this problem because it uses a single HBaseClient object for your entire cluster, and it's thread-safe. -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
