On Sat, May 14, 2011 at 6:40 AM, Thibault Dory <[email protected]> wrote: > I'm wondering what are the possible bottlenecks of an HBase cluster, even if > there are cache mechanism, the fact that some data are centralized could > lead to a bottleneck (even if its quite theoretical given the load needed to > achieve it).
Isn't that what your paper is about? > Would it be right to say the following ? > > - The namenode is storing all the meta data and must scale vertically if > the cluster becomes very big The fact that there's only 1 namenode is bad in multiple ways, generally people will be more bothered by the fact that it's a single point of failure. Larger companies do hit the limits of that single machine so Y! worked on "Federated Namenodes" as a way to circumvent that. See http://www.slideshare.net/huguk/hdfs-federation-hadoop-summit2011 This work is already available in hadoop's svn trunk. > - There is only one node storing the -ROOT- table and only one node > storing the .META. table, if I'm doing a lot of random accesses and that my > dataset is VERY large, could I overload those node? Again, I believe this is the subject of your paper right? Anyways so in general in -ROOT- has 1 row, and that row is cached. Even if you have thousands of clients that need to update their .META. location (this would only happen at the beginning of a MR job or if .META. moves), serving from memory is fast. Next you have .META., again the clients cache their region locations so once they have it they don't need to talk to .META. until a region moves or gets split. Also .META. isn't that big and is usually served directly from memory. The BT paper mentions they allow the splitting of .META. when it grows a bit too much and this is something we've blocked for the moment in HBase. J-D
