Re: region server dead and datanode block movement error

Rohit Kelkar Thu, 27 Feb 2014 09:14:02 -0800

Yes. For the same conditions (dataset size, etc) the issue occurred 4 out
of 5 times. Brought the region server down with a YouAreDeadException.
Thats why I started digging into the DN and NN logs etc. And could see a
common pattern as mentioned in my first mail.


- R


On Thu, Feb 27, 2014 at 11:09 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

> so you might want to get some metrics over time, like using Ganglia or
> anything else. To track memory usage and network availability.
>
> are you often facing this issue? Is it "easy" for you to reproduce it?
>
>
> 2014-02-27 12:05 GMT-05:00 Rohit Kelkar <[email protected]>:
>
> > Oh yes and forgot to add the ZK process
> > ZK = 5GB
> >
> > Total = 45GB
> >
> >
> > On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <[email protected]
> > >wrote:
> >
> > > Hi Jean-Marc,
> > >
> > > Each node has 48GB RAM
> > > To isolate and debug the RS failure issue, we have switched off all
> other
> > > tools. The only processes running are
> > > - DN = 4GB
> > > - RS = 6GB
> > > - TT = 4GB
> > > - num mappers available on the node = 4 * 4GB = 16GB
> > > - num reducers available on the node = 2 * 4GB = 8GB
> > > - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
> > >
> > > Total = 40GB
> > >
> > >
> > > On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
> > > [email protected]> wrote:
> > >
> > >> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
> > >> (responseTooSlow):
> > >> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
> > >> version=1, client version=29, methodsFingerPrint=54742778","client":"
> > >> 10.0.0.96:46618
> > >>
> > >>
> >
> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
> > >> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > >> slept
> > >> 10193644ms instead of 10000000ms, this is likely due to a long garbage
> > >> collecting pause and it's usually bad, see
> > >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >>
> > >> Your issue is clearly this.
> > >>
> > >> For the swap, it's not because you set swappiness that Linux will not
> > >> swap.
> > >> It will try to not swap, but if it really has to, it will.
> > >>
> > >> How many GB on your server? How many for the DN,for th RS, etc. any TT
> > on
> > >> them? Any other tool? If TT, how many slots? How many GB per slots?
> > >>
> > >> JM
> > >>
> > >>
> > >> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <[email protected]>:
> > >>
> > >> > Hi Jean-Marc,
> > >> >
> > >> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
> > >> events
> > >> > before 13:41:00. In the log I see a few responseTooSlow warnings at
> > >> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
> > >> > At 13:41:00 there is a Sleeper warning - WARN
> > >> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
> > >> > 10000000ms, this is likely due to a long garbage collecting pause
> and
> > >> it's
> > >> > usually bad, see ...
> > >> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session
> > timed
> > >> > out, have not heard from server in 260409ms for sessionid
> > >> > 0x34432befe5417d2, closing socket connection and attempting
> reconnect.
> > >> >
> > >> > Looking at some of the reasons you mentioned -
> > >> > 1. I analyzed the GC logs for this RS. In the last 10 mins before
> the
> > RS
> > >> > went down, the GC times are less than 1 sec. Nothing that will take
> > >> 260409
> > >> > ms as indicated above in the RS log.
> > >> > 2. The RS node has swappiness set to 0
> > >> > 3. So I think I should investigate the possibility of network
> issues.
> > >> Any
> > >> > pointers where I could start?
> > >> >
> > >> > - R
> > >> >
> > >> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Rohit,
> > >> > >
> > >> > > Usually YouAreDeadException is when your RegionServer is to slow.
> It
> > >> gets
> > >> > > kicked out by Master+ZK but then try to join back and get informed
> > it
> > >> has
> > >> > > bene kicked out.
> > >> > >
> > >> > > Reasons:
> > >> > > - Long Gargabe Collection;
> > >> > > - Swapping;
> > >> > > - Network issues (get disconnected, then re-connected);
> > >> > > - etc.
> > >> > >
> > >> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
> > >> > >
> > >> > >
> > >> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <[email protected]>:
> > >> > >
> > >> > > > Hi, has anybody been facing similar issues?
> > >> > > >
> > >> > > > - R
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
> > >> [email protected]
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
> > >> > production
> > >> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster
> and
> > a
> > >> 6th
> > >> > > > node
> > >> > > > > running just the name node and hmaster.
> > >> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > >> > > > > http://pastebin.com/44aFyYZV
> > >> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception
> > >>  for
> > >> > > block
> > >> > > > > blk_-6695300470410774365_837638 java.io.EOFException at
> 13:41:00
> > >> > > followed
> > >> > > > > by YouAreDeadException at the same time.
> > >> > > > > I grep'ed this block in the Datanode (see log here
> > >> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception
> > in
> > >> > > > > receiveBlock for block blk_-6695300470410774365_837638
> > >> > > > > java.nio.channels.ClosedByInterruptException.
> > >> > > > > I have also attached the namenode logs around the block here
> > >> > > > > http://pastebin.com/9NE9J8s1
> > >> > > > >
> > >> > > > > Across several RS failure instances I see the following
> pattern
> > -
> > >> the
> > >> > > > > region server YouAreDeadException is always preceeded by the
> > >> > > EOFException
> > >> > > > > and datanode ClosedByInterruptException
> > >> > > > >
> > >> > > > > Is the error in the movement of the block causing the region
> > >> server
> > >> > to
> > >> > > > > report a YouAreDeadException? And of course, how do I solve
> > this?
> > >> > > > >
> > >> > > > > - R
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: region server dead and datanode block movement error

Reply via email to