Hi Jean, It happened again today during a server restart. This involved a hadoop start following by a hbase start. There was also an exception when hbase master came up on reading a file from hadoop. Not sure if that is the problem. Pasted those logs too.
Current state of the system: master, zookeeper, region servers are all up. But region servers are not connected to master. Here are the logs .... 1. logs on hbase master and hadoop namenode. hbase-master.out :http://pastebin.com/6a88nRh5 hadoop-namemode: http://pastebin.com/wHP5uQBh 2. syslog on hbase master. http://pastebin.com/S9KVVsSf 3. syslog on hbase regionservers. Posted one the other is the same. http://pastebin.com/kR42Xt2t I did a netstat -tna to confirm that master is listening on port 127.0.0.121:60000 I did a restart of regionservers only and its able to connect fine. thanks ishwar On Fri, Jun 11, 2010 at 12:56 PM, Jean-Daniel Cryans <[email protected]>wrote: > You can check the general health by using the webui, it runs on the > master node at port 60010. > > For the errors, the context you gave is so limited that giving any > meaningful answer is impossible. Please post full logs on a web server > or on pastebin.com (or your preferred code pasting site) if it fits. > > J-D > > On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani <[email protected]> > wrote: > > Hi, > > > > I have a hbase hadoop cluster setup. 6 days back we did a cold restart of > > our system. > > I recently noticed that a hbase query was timing out with > > > > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out > trying > > to locate root region > > > > > > I looked at the master logs and none of the region servers had connected > > > > 2010-06-04 00:00:21,510 INFO > org.apache.hadoop.hbase.master.ServerManager: 0 > > region servers, 0 dead, average load NaN > > > > > > The master had a stderr output when it started > > > > java.io.EOFException > > .... > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not > > complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by > > DFSClient_-107490689 > > > > The regionservers have been trying to connect with the master ever since > > with the error > > > > 2010-06-03 14:33:28,960 WARN > > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to > > master. Retrying. Error was: java.net.ConnectException: Connection > refused > > > > > > All the region servers and master processes are running now. Except none > of > > the region servers are connected. > > > > > > My first question is how to monitor this problem. None of the logs report > an > > error. I monitor processes so they are all fine. The logs don't report > any > > error. > > How do i check for the general health of the cluster? > > > > > > My second question is why did this happen? > > > > thanks > > ishwar > > >
