Hi Jean-Marc, Each node has 48GB RAM To isolate and debug the RS failure issue, we have switched off all other tools. The only processes running are - DN = 4GB - RS = 6GB - TT = 4GB - num mappers available on the node = 4 * 4GB = 16GB - num reducers available on the node = 2 * 4GB = 8GB - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
Total = 40GB On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari < [email protected]> wrote: > 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer: > (responseTooSlow): > {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc > version=1, client version=29, methodsFingerPrint=54742778","client":" > 10.0.0.96:46618 > > ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"} > 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We slept > 10193644ms instead of 10000000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > Your issue is clearly this. > > For the swap, it's not because you set swappiness that Linux will not swap. > It will try to not swap, but if it really has to, it will. > > How many GB on your server? How many for the DN,for th RS, etc. any TT on > them? Any other tool? If TT, how many slots? How many GB per slots? > > JM > > > 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <[email protected]>: > > > Hi Jean-Marc, > > > > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with > events > > before 13:41:00. In the log I see a few responseTooSlow warnings at > > 13:34:00, 13:36:00. Then no activity till 13:41:00. > > At 13:41:00 there is a Sleeper warning - WARN > > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of > > 10000000ms, this is likely due to a long garbage collecting pause and > it's > > usually bad, see ... > > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed > > out, have not heard from server in 260409ms for sessionid > > 0x34432befe5417d2, closing socket connection and attempting reconnect. > > > > Looking at some of the reasons you mentioned - > > 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS > > went down, the GC times are less than 1 sec. Nothing that will take > 260409 > > ms as indicated above in the RS log. > > 2. The RS node has swappiness set to 0 > > 3. So I think I should investigate the possibility of network issues. Any > > pointers where I could start? > > > > - R > > > > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari < > > [email protected]> wrote: > > > > > Hi Rohit, > > > > > > Usually YouAreDeadException is when your RegionServer is to slow. It > gets > > > kicked out by Master+ZK but then try to join back and get informed it > has > > > bene kicked out. > > > > > > Reasons: > > > - Long Gargabe Collection; > > > - Swapping; > > > - Network issues (get disconnected, then re-connected); > > > - etc. > > > > > > what do you have before 2014-02-21 13:41:00,308 in the logs? > > > > > > > > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <[email protected]>: > > > > > > > Hi, has anybody been facing similar issues? > > > > > > > > - R > > > > > > > > > > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar < > [email protected] > > > > >wrote: > > > > > > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in > > production > > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a > 6th > > > > node > > > > > running just the name node and hmaster. > > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here > > > > > http://pastebin.com/44aFyYZV > > > > > The RS log shows a DFSOutputStream ResponseProcessor exception for > > > block > > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 > > > followed > > > > > by YouAreDeadException at the same time. > > > > > I grep'ed this block in the Datanode (see log here > > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in > > > > > receiveBlock for block blk_-6695300470410774365_837638 > > > > > java.nio.channels.ClosedByInterruptException. > > > > > I have also attached the namenode logs around the block here > > > > > http://pastebin.com/9NE9J8s1 > > > > > > > > > > Across several RS failure instances I see the following pattern - > the > > > > > region server YouAreDeadException is always preceeded by the > > > EOFException > > > > > and datanode ClosedByInterruptException > > > > > > > > > > Is the error in the movement of the block causing the region server > > to > > > > > report a YouAreDeadException? And of course, how do I solve this? > > > > > > > > > > - R > > > > > > > > > > > > > > >
