Re: region server dead and datanode block movement error

Rohit Kelkar Thu, 27 Feb 2014 08:39:11 -0800

Hi Jean-Marc,

I have updated the RS log here (http://pastebin.com/bVDvMvrB) with events
before 13:41:00. In the log I see a few responseTooSlow warnings at
13:34:00, 13:36:00. Then no activity till 13:41:00.
At 13:41:00 there is a Sleeper warning - WARN
org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
10000000ms, this is likely due to a long garbage collecting pause and it's
usually bad, see ...
Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
out, have not heard from server in 260409ms for sessionid
0x34432befe5417d2, closing socket connection and attempting reconnect.


Looking at some of the reasons you mentioned -
1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
went down, the GC times are less than 1 sec. Nothing that will take 260409
ms as indicated above in the RS log.
2. The RS node has swappiness set to 0
3. So I think I should investigate the possibility of network issues. Any
pointers where I could start?

- R

On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

> Hi Rohit,
>
> Usually YouAreDeadException is when your RegionServer is to slow. It gets
> kicked out by Master+ZK but then try to join back and get informed it has
> bene kicked out.
>
> Reasons:
> - Long Gargabe Collection;
> - Swapping;
> - Network issues (get disconnected, then re-connected);
> - etc.
>
> what do you have before 2014-02-21 13:41:00,308 in the logs?
>
>
> 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <[email protected]>:
>
> > Hi, has anybody been facing similar issues?
> >
> > - R
> >
> >
> > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <[email protected]
> > >wrote:
> >
> > > We are running hbase 0.94.2 on hadoop 0.20 append version in production
> > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th
> > node
> > > running just the name node and hmaster.
> > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > > http://pastebin.com/44aFyYZV
> > > The RS log shows a DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
> followed
> > > by YouAreDeadException at the same time.
> > > I grep'ed this block in the Datanode (see log here
> > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> > > receiveBlock for block blk_-6695300470410774365_837638
> > > java.nio.channels.ClosedByInterruptException.
> > > I have also attached the namenode logs around the block here
> > > http://pastebin.com/9NE9J8s1
> > >
> > > Across several RS failure instances I see the following pattern - the
> > > region server YouAreDeadException is always preceeded by the
> EOFException
> > > and datanode ClosedByInterruptException
> > >
> > > Is the error in the movement of the block causing the region server to
> > > report a YouAreDeadException? And of course, how do I solve this?
> > >
> > > - R
> > >
> >
>

Re: region server dead and datanode block movement error

Reply via email to