I've dealt with dozens of spontaneous shutdowns in recent weeks. (We call them Region Server Suicides)
The files problem is where the OS (i.e. linux) limits the number of files a user can open at one time. A common default of 1024 isn't enough for hbase. Based purely on empirical evidence, you will have failed earlier than 100 million rows if you your problem was the number of files for the hbase user. You can also run into the same problem for the hadoop user, but the # of files issue shows up earlier for the hbase user. It is probably best to change them both at the same time. While not having enough files is definitely a gotcha, there are a few other things to look out for as well. Debugging: One misleading aspect of tracking down this kind of problem is that most of the messages that show up when you experience it are actually a side effect of something that happened earlier. You've probably realized this, since you've searched over a long period of time in your logs. Other things to consider: * The most common reason I've had for Region Server Suicide is zookeeper. The region server thinks zookeeper is down. I thought this had to do with heavy load, but this also happens for me even when there is nothing running. I haven't been able to find a quantifiable cause. This is just a weakness that exists in the hbase-zookeeper dependency. Higher loads exacerbate the problem, but are not required for a Region Server Suicide event to occur. * Another reason is the HDFS dependency... if a file is perhaps temporarily unavailable for any reason, HBase handles this situation with Region Server Suicide. HBase is a powerful tool that allows us to do more with less, but it is currently somewhat brittle with respect to its dependencies. Suicide is the standard response to any hiccup with them. Hopefully the response will become less "final" as HBase becomes more robust. Perhaps if there were a setting, whether or not a region server is allowed to commit suicide, some of us would feel more comfortable with the idea. In the mean time, you can try to work around any of these issues by using bigger hardware than you would otherwise think is needed and not letting the load get very high. For example, I tend to have these kinds of problems much less often when the load on any individual machine never goes above the number of cores. I also recommend sticking to the latest version available. FYI, Matthew On Sep 13, 2010, at 7:20 PM, Jean-Daniel Cryans wrote: > Can we see the actual line of when it died, with a lot of context and > please in a pastebin.com > > Also, most of the time users get this kind of error because they > didn't configure HBase and Hadoop properly, mostly the last > requirement: > http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements > > J-D > > On Mon, Sep 13, 2010 at 7:08 PM, ZhouShuaifeng 00100568 > <[email protected]> wrote: >> Hi All, >> >> I encounted some problem when doing putting data test on hbase. Please help. >> Thanks a lot. >> >> After putting about millions of rows, the 2 region servers of 3 were stopped. >> Server 1 stopped when putting about 50 million rows. >> Server 2 stopped when putting about 100 million rows. >> >> Some exceptions are throwed. >> The client exception info is below: >> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address >> listed in .META. for region xxx. >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:833) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:677) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1419) >> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664) >> at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549) >> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535) >> >> The server exception is below: >> org.apache.hadoop.hbase.NotServingRegionException: xxx. is closed >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:2122) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:2211) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1493) >> at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1703) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionServer.java:2361) >> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919) >> >> region server 1 stoped at about 13:59 >> 2010-09-12 13:59:01,017 INFO org.apache.hadoop.hbase.master.ServerManager: 2 >> region servers, 1 dead, average load 99.5[md-prod04,60020,1284169880500] >> >> the last 2 logs of this regionserver before it stoped is: >> 2010-09-12 13:57:46,170 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: >> Caches flushed, doing commit now (which includes update scanners) >> 2010-09-12 13:57:46,172 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Finished memstore flush of ~21.0m for region >> percontent_hr,2000-01-03#http#001#url18#s#states44#0,1284322397591.8a557b61c9eb4b117368051b98e8d1d1. >> in 290ms, sequence id=155719842, compaction requested=false >> >> region server 2 stoped at about 19:36: >> 2010-09-12 19:37:01,104 INFO org.apache.hadoop.hbase.master.ServerManager: 1 >> region servers, 1 dead, average load 356.0[md-prod01,60020,1284169861364] >> >> the last logs of this regionserver before it stoped is: >> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.Store: >> Completed compaction of 3 file(s) in visitors of xxx.; new storefile is >> hdfs://xxx; store size is 201.3m >> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> compaction completed on region xxx. in 9sec >> 2010-09-12 19:36:04,604 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> started. Attempting to free 20845272 bytes >> 2010-09-12 19:36:04,609 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction >> completed. Freed 19920968 bytes. Priority Sizes: Single=43.940674MB >> (46075136), Multi=74.515015MB (78134656),Memory=49.535378MB (51941608) >> >> ****************************************************************************************** >> This email and its attachments contain confidential information from >> HUAWEI, which is intended only for the person or entity whose address is >> listed above. Any use of the information contained here in any way >> (including, but not limited to, total or partial disclosure, reproduction, >> or dissemination) by persons other than the intended recipient(s) is >> prohibited. If you receive this email in error, please notify the sender by >> phone or email >> immediately and delete it! >> >> ***************************************************************************************** >>
