Here is an interesting anecdote. I had regionservers running on each of 8 node hadoop cluster. Yesterday morning, I ran a series of MR jobs where the last MR job does a batched inserts into a production MySQL server. All other MR jobs have 3 mappers and 3 reducers running on a node. The db job has 3 mapper and 1 reducer on each node. The regionservers stayed up until the db reducer job started running, then the heart beats to zookeeper were lost and they all went down.
It looks to be network or swap related. I will dig into cacti result more.
