Hi, I have a hbase hadoop cluster setup. 6 days back we did a cold restart of our system. I recently noticed that a hbase query was timing out with
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region I looked at the master logs and none of the region servers had connected 2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN The master had a stderr output when it started java.io.EOFException .... org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by DFSClient_-107490689 The regionservers have been trying to connect with the master ever since with the error 2010-06-03 14:33:28,960 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was: java.net.ConnectException: Connection refused All the region servers and master processes are running now. Except none of the region servers are connected. My first question is how to monitor this problem. None of the logs report an error. I monitor processes so they are all fine. The logs don't report any error. How do i check for the general health of the cluster? My second question is why did this happen? thanks ishwar
