Hi, we are running a 9 node Hadoop/Hbase cluster on EC2. We expect the number of nodes to grow multi fold in a couple of months.
Of these 9 instances, three are m1.medium and they act as Hadoop Namenode, Hbase master and the secondary namenode. The remaining 6 nodes are all mi.large instances where we run datanodes,tasktrackers and the regionservers. The large instances have 7.5 GB of total memory. There are couple of issues that we are facing, and we feel that this architecture is not going to scale for us. 1) The uptime. We have been struggling to keep the cluster up and running. Sometimes the region-servers fail due to overload. Sometimes the GC pause forces the zookeeper to declare them dead. Sometime the regionserver get killed due to OutOfMemory. The map reduce always brings atleast 2-3 region-servers down. 2) We have 4G memory allocated for Region servers and we run data nodes, task trackers on them as well. We have observed that our region servers are vulnerable when we run MR jobs in our cluster. Is this some sign of competing resources (Memory) on the region servers or is this(having RS and data nodes/task trackers) generally not advisable? 3) So far our data size is close to 3TB(including replication) split over 6 region servers. An example of regionserver stat- numberOfOnlineRegions=231, numberOfStores=2055, numberOfStorefiles=1180, storefileIndexSizeMB=15, rootIndexSizeKB=16256. We would like to understand what is an ideal load for a region server - both in terms of rows and the actual data that it can serve? This is because, I would like understand if we are over pounding our system which it is not expected to handle. 4). We have always observed that, our Region Servers go down usually just after a long GC pause(DURATION), because, it prevented the RSs from acking to ZK for the session maintenance, or, it is usually a OOM. 5)The cost. The large instances and the 3 factor replication are expensive. We are not sure if horizontally scaling with large instances is the way to go. It'd be very useful if we can understand some already known limitations of the system, so that, we wont end up spending time in the wrong direction. Though we have been able to fix a lot of issues when they happen, we are looking for a more stable architecture. We feel that the cluster setup we have is wrong and we need confirmation and suggestions from the community. we will really appreciate your suggestions/pointers and any ideas would be very helpful. Also, we are hoping you can just tell us about your architecture which is not giving you nightmares. regards, Prem
