Last night a regionserver in my cluster stopped responding in a timely manner for about 20 minutes. I know that stop-the-world GC can cause this type of behavior, but 20 minutes seems excessive.
The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We are using the latest java 7 from oracle. HDFS is provided by an Isilon cluster. The server workload is read/write: the writing process reads all rows it is about to write, updates them if they exist, and then writes all the rows (replacing ones that were updated). The last messages before the pause were regarding an HLog roll: DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultReplication INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support getDefaultBlockSize During the next 20 minutes there were a handful of sporadic LruBlockCache stats messages but nothing else. After 20 minutes, normal operation resumed. Is 20 minutes for a GC pause expected given the operational load and machine specs? Could a GC pause include periodic log messages? If it wasn't a GC pause, what else could it be? --Tom
