Hi Jean,

It happened again today during a server restart. This involved a hadoop
start following by a hbase start.
There was also an exception when hbase master came up on reading  a file
from hadoop. Not sure if that is the problem.
Pasted those logs too.


Current state of the system: master, zookeeper, region servers are all up.
But region servers are not connected to master.

Here are the logs ....


1. logs on hbase master and hadoop namenode.
hbase-master.out :http://pastebin.com/6a88nRh5
hadoop-namemode: http://pastebin.com/wHP5uQBh

2.  syslog on hbase master.
http://pastebin.com/S9KVVsSf

3. syslog on hbase regionservers. Posted one the other is the same.
http://pastebin.com/kR42Xt2t


I did a netstat -tna to confirm that master is listening on port
127.0.0.121:60000

I did a restart of regionservers only and its able to connect fine.


thanks
ishwar


On Fri, Jun 11, 2010 at 12:56 PM, Jean-Daniel Cryans <[email protected]>wrote:

> You can check the general health by using the webui, it runs on the
> master node at port 60010.
>
> For the errors, the context you gave is so limited that giving any
> meaningful answer is impossible. Please post full logs on a web server
> or on pastebin.com (or your preferred code pasting site) if it fits.
>
> J-D
>
> On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani <[email protected]>
> wrote:
> > Hi,
> >
> > I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
> > our system.
> > I recently noticed that a hbase query was timing out with
> >
> > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out
> trying
> > to locate root region
> >
> >
> > I looked at the master logs and none of the region servers had connected
> >
> > 2010-06-04 00:00:21,510 INFO
> org.apache.hadoop.hbase.master.ServerManager: 0
> > region servers, 0 dead, average load NaN
> >
> >
> > The master had a stderr output when it started
> >
> > java.io.EOFException
> > ....
> > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
> > complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
> > DFSClient_-107490689
> >
> > The regionservers have been trying to connect with the master ever since
> > with the error
> >
> > 2010-06-03 14:33:28,960 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
> > master. Retrying. Error was: java.net.ConnectException: Connection
> refused
> >
> >
> > All the region servers and master processes are running now. Except none
> of
> > the region servers are connected.
> >
> >
> > My first question is how to monitor this problem. None of the logs report
> an
> > error.  I monitor processes so they are all fine. The logs don't report
> any
> > error.
> > How do i check for the general health of the cluster?
> >
> >
> > My second question is why did this happen?
> >
> > thanks
> > ishwar
> >
>

Reply via email to