I've dealt with dozens of spontaneous shutdowns in recent weeks. (We call them 
Region Server Suicides)

The files problem is where the OS (i.e. linux) limits the number of files a 
user can open at one time.  A common default of 1024 isn't enough for hbase.   
Based purely on empirical evidence, you will have failed earlier than 100 
million rows if you your problem was the number of files for the hbase user.  
You can also run into the same problem for the hadoop user, but the # of files 
issue shows up earlier for the hbase user.   It is probably best to change them 
both at the same time.

While not  having enough files is definitely a gotcha, there are a few other 
things to look out for as well.  

Debugging:  One misleading aspect of tracking down this kind of problem is that 
most of the messages that show up when you experience it are actually a side 
effect of something that happened earlier.   You've probably realized this, 
since you've searched over a long period of time in your logs. 

Other things to consider: 

* The most common reason I've had for Region Server Suicide is zookeeper.  The 
region server thinks zookeeper is down.  I thought this had to do with heavy 
load, but this also happens for me even when there is nothing running.  I 
haven't been able to find a quantifiable cause.  This is just a weakness that 
exists in the hbase-zookeeper dependency.  Higher loads exacerbate the problem, 
but are not required for a Region Server Suicide event to occur. 

* Another reason is the HDFS dependency... if a file is perhaps temporarily 
unavailable for any reason, HBase handles this situation with Region Server 
Suicide.  

HBase is a powerful tool that allows us to do more with less, but it is 
currently somewhat brittle with respect to its dependencies.  Suicide is the 
standard response to any hiccup with them.  Hopefully the response will become 
less "final" as HBase becomes more robust.   Perhaps if there were a setting, 
whether or not a region server is allowed to commit suicide, some of us would 
feel more comfortable with the idea. 

In the mean time, you can try to work around any of these issues by using 
bigger hardware than you would otherwise think is needed and not letting the 
load get very high.  For example, I tend to have these kinds of problems much 
less often when the load on any individual machine never goes above the number 
of cores.  

I also recommend sticking to the latest version available. 

FYI, 

Matthew


On Sep 13, 2010, at 7:20 PM, Jean-Daniel Cryans wrote:

> Can we see the actual line of when it died, with a lot of context and
> please in a pastebin.com
> 
> Also, most of the time users get this kind of error because they
> didn't configure HBase and Hadoop properly, mostly the last
> requirement: 
> http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#requirements
> 
> J-D
> 
> On Mon, Sep 13, 2010 at 7:08 PM, ZhouShuaifeng 00100568
> <[email protected]> wrote:
>> Hi All,
>> 
>> I encounted some problem when doing putting data test on hbase. Please help. 
>> Thanks a lot.
>> 
>> After putting about millions of rows, the 2 region servers of 3 were stopped.
>> Server 1 stopped when putting about 50 million rows.
>> Server 2 stopped when putting about 100 million rows.
>> 
>> Some exceptions are throwed.
>> The client exception info is below:
>> org.apache.hadoop.hbase.client.NoServerForRegionException: No server address 
>> listed in .META. for region xxx.
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:833)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:677)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1419)
>>        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664)
>>        at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549)
>>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535)
>> 
>> The server exception is below:
>> org.apache.hadoop.hbase.NotServingRegionException: xxx. is closed
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:2122)
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:2211)
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.java:1493)
>>        at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1703)
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionServer.java:2361)
>>        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:919)
>> 
>> region server 1 stoped at about 13:59
>> 2010-09-12 13:59:01,017 INFO org.apache.hadoop.hbase.master.ServerManager: 2 
>> region servers, 1 dead, average load 99.5[md-prod04,60020,1284169880500]
>> 
>> the last 2 logs of this regionserver before it stoped is:
>> 2010-09-12 13:57:46,170 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
>> Caches flushed, doing commit now (which includes update scanners)
>> 2010-09-12 13:57:46,172 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
>> Finished memstore flush of ~21.0m for region 
>> percontent_hr,2000-01-03#http#001#url18#s#states44#0,1284322397591.8a557b61c9eb4b117368051b98e8d1d1.
>>  in 290ms, sequence id=155719842, compaction requested=false
>> 
>> region server 2 stoped at about 19:36:
>> 2010-09-12 19:37:01,104 INFO org.apache.hadoop.hbase.master.ServerManager: 1 
>> region servers, 1 dead, average load 356.0[md-prod01,60020,1284169861364]
>> 
>> the last logs of this regionserver before it stoped is:
>> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.Store: 
>> Completed compaction of 3 file(s) in visitors of xxx.; new storefile is 
>> hdfs://xxx; store size is 201.3m
>> 2010-09-12 19:36:02,398 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
>> compaction completed on region xxx. in 9sec
>> 2010-09-12 19:36:04,604 DEBUG 
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction 
>> started.  Attempting to free 20845272 bytes
>> 2010-09-12 19:36:04,609 DEBUG 
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction 
>> completed. Freed 19920968 bytes.  Priority Sizes: Single=43.940674MB 
>> (46075136), Multi=74.515015MB (78134656),Memory=49.535378MB (51941608)
>> 
>> ******************************************************************************************
>>  This email and its attachments contain confidential information from 
>> HUAWEI, which is intended only for the person or entity whose address is 
>> listed above. Any use of the information contained here in any way 
>> (including, but not limited to, total or partial disclosure, reproduction, 
>> or dissemination) by persons other than the intended recipient(s) is 
>> prohibited. If you receive this email in error, please notify the sender by 
>> phone or email
>>  immediately and delete it!
>>  
>> *****************************************************************************************
>> 

Reply via email to