Hi I am not sure whether this was observed before but I have been running into issues with a hung master.
This is on a 700 node cluster with Hbase 0.90.0 and Hadoop 0.20.x. Every now and then, the master fails to respond (any request throws MasterNotRunning Exception)and gets stuck and remain unresponsive. The master process is alive. The last line in the master log says "Waiting for split writer threads to finish" sometime just after startup (log messages shown below). And there are no more log messages after that even after a couple of hours. jstack -F on the process throws a DebuggerException on every thread and says no deadlocks. Is there any way else to monitor the master? I didn't observe this in my small scale (40-node) tests: Hbase 0.90 worked just fine.. Cheers Vidhya 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110270.yst.yahoo.net,60020,1296199618314 belongs to an existing region server 2011-01-28 07:35:49,866 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 doesn't belong to a known region server, splitting 2011-01-28 07:35:49,867 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-0,5,main]: starting 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-1,5,main]: starting 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 1: hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266, length=0 2011-01-28 07:35:49,867 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread Thread[WriterThread-2,5,main]: starting 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.util.FSUtils: Running on HDFS without append enabled may result in data loss 2011-01-28 07:35:49,867 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 might be still open, length is 0 2011-01-28 07:35:49,869 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 for reading. File is emptyjava.io.EOFException 2011-01-28 07:35:49,875 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Archived processed log hdfs://b3110120.yst.yahoo.net:4600/hbase/.logs/b3110271.yst.yahoo.net,60020,1296183218615/b3110271.yst.yahoo.net%3A60020.1296183219266 to hdfs://b3110120.yst.yahoo.net:4600/hbase/.oldlogs/b3110271.yst.yahoo.net%3A60020.1296183219266 2011-01-28 07:35:49,877 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
