You can check the general health by using the webui, it runs on the master node at port 60010.
For the errors, the context you gave is so limited that giving any meaningful answer is impossible. Please post full logs on a web server or on pastebin.com (or your preferred code pasting site) if it fits. J-D On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani <[email protected]> wrote: > Hi, > > I have a hbase hadoop cluster setup. 6 days back we did a cold restart of > our system. > I recently noticed that a hbase query was timing out with > > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying > to locate root region > > > I looked at the master logs and none of the region servers had connected > > 2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0 > region servers, 0 dead, average load NaN > > > The master had a stderr output when it started > > java.io.EOFException > .... > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not > complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by > DFSClient_-107490689 > > The regionservers have been trying to connect with the master ever since > with the error > > 2010-06-03 14:33:28,960 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to > master. Retrying. Error was: java.net.ConnectException: Connection refused > > > All the region servers and master processes are running now. Except none of > the region servers are connected. > > > My first question is how to monitor this problem. None of the logs report an > error. I monitor processes so they are all fine. The logs don't report any > error. > How do i check for the general health of the cluster? > > > My second question is why did this happen? > > thanks > ishwar >
