Yes. For the same conditions (dataset size, etc) the issue occurred 4 out of 5 times. Brought the region server down with a YouAreDeadException. Thats why I started digging into the DN and NN logs etc. And could see a common pattern as mentioned in my first mail.
- R On Thu, Feb 27, 2014 at 11:09 AM, Jean-Marc Spaggiari < [email protected]> wrote: > so you might want to get some metrics over time, like using Ganglia or > anything else. To track memory usage and network availability. > > are you often facing this issue? Is it "easy" for you to reproduce it? > > > 2014-02-27 12:05 GMT-05:00 Rohit Kelkar <[email protected]>: > > > Oh yes and forgot to add the ZK process > > ZK = 5GB > > > > Total = 45GB > > > > > > On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <[email protected] > > >wrote: > > > > > Hi Jean-Marc, > > > > > > Each node has 48GB RAM > > > To isolate and debug the RS failure issue, we have switched off all > other > > > tools. The only processes running are > > > - DN = 4GB > > > - RS = 6GB > > > - TT = 4GB > > > - num mappers available on the node = 4 * 4GB = 16GB > > > - num reducers available on the node = 2 * 4GB = 8GB > > > - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB > > > > > > Total = 40GB > > > > > > > > > On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari < > > > [email protected]> wrote: > > > > > >> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer: > > >> (responseTooSlow): > > >> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc > > >> version=1, client version=29, methodsFingerPrint=54742778","client":" > > >> 10.0.0.96:46618 > > >> > > >> > > > ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"} > > >> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We > > >> slept > > >> 10193644ms instead of 10000000ms, this is likely due to a long garbage > > >> collecting pause and it's usually bad, see > > >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > >> > > >> Your issue is clearly this. > > >> > > >> For the swap, it's not because you set swappiness that Linux will not > > >> swap. > > >> It will try to not swap, but if it really has to, it will. > > >> > > >> How many GB on your server? How many for the DN,for th RS, etc. any TT > > on > > >> them? Any other tool? If TT, how many slots? How many GB per slots? > > >> > > >> JM > > >> > > >> > > >> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <[email protected]>: > > >> > > >> > Hi Jean-Marc, > > >> > > > >> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with > > >> events > > >> > before 13:41:00. In the log I see a few responseTooSlow warnings at > > >> > 13:34:00, 13:36:00. Then no activity till 13:41:00. > > >> > At 13:41:00 there is a Sleeper warning - WARN > > >> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of > > >> > 10000000ms, this is likely due to a long garbage collecting pause > and > > >> it's > > >> > usually bad, see ... > > >> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session > > timed > > >> > out, have not heard from server in 260409ms for sessionid > > >> > 0x34432befe5417d2, closing socket connection and attempting > reconnect. > > >> > > > >> > Looking at some of the reasons you mentioned - > > >> > 1. I analyzed the GC logs for this RS. In the last 10 mins before > the > > RS > > >> > went down, the GC times are less than 1 sec. Nothing that will take > > >> 260409 > > >> > ms as indicated above in the RS log. > > >> > 2. The RS node has swappiness set to 0 > > >> > 3. So I think I should investigate the possibility of network > issues. > > >> Any > > >> > pointers where I could start? > > >> > > > >> > - R > > >> > > > >> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari < > > >> > [email protected]> wrote: > > >> > > > >> > > Hi Rohit, > > >> > > > > >> > > Usually YouAreDeadException is when your RegionServer is to slow. > It > > >> gets > > >> > > kicked out by Master+ZK but then try to join back and get informed > > it > > >> has > > >> > > bene kicked out. > > >> > > > > >> > > Reasons: > > >> > > - Long Gargabe Collection; > > >> > > - Swapping; > > >> > > - Network issues (get disconnected, then re-connected); > > >> > > - etc. > > >> > > > > >> > > what do you have before 2014-02-21 13:41:00,308 in the logs? > > >> > > > > >> > > > > >> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <[email protected]>: > > >> > > > > >> > > > Hi, has anybody been facing similar issues? > > >> > > > > > >> > > > - R > > >> > > > > > >> > > > > > >> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar < > > >> [email protected] > > >> > > > >wrote: > > >> > > > > > >> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in > > >> > production > > >> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster > and > > a > > >> 6th > > >> > > > node > > >> > > > > running just the name node and hmaster. > > >> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here > > >> > > > > http://pastebin.com/44aFyYZV > > >> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception > > >> for > > >> > > block > > >> > > > > blk_-6695300470410774365_837638 java.io.EOFException at > 13:41:00 > > >> > > followed > > >> > > > > by YouAreDeadException at the same time. > > >> > > > > I grep'ed this block in the Datanode (see log here > > >> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception > > in > > >> > > > > receiveBlock for block blk_-6695300470410774365_837638 > > >> > > > > java.nio.channels.ClosedByInterruptException. > > >> > > > > I have also attached the namenode logs around the block here > > >> > > > > http://pastebin.com/9NE9J8s1 > > >> > > > > > > >> > > > > Across several RS failure instances I see the following > pattern > > - > > >> the > > >> > > > > region server YouAreDeadException is always preceeded by the > > >> > > EOFException > > >> > > > > and datanode ClosedByInterruptException > > >> > > > > > > >> > > > > Is the error in the movement of the block causing the region > > >> server > > >> > to > > >> > > > > report a YouAreDeadException? And of course, how do I solve > > this? > > >> > > > > > > >> > > > > - R > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
