Table goes offline - temporary outage + Retries Exhausted (related?)

Stuart Smith Thu, 29 Jul 2010 11:10:31 -0700

Hello,
   I have two problems that may or may not be related.

One is trying to figure out a self-correcting outage I had last evening.


I noticed issues starting with clients reporting:

RetriesExhaustedException: Trying to contact region server Some server...

I didn't see much going on in the regionserver logs, except for some major 
compactions. Eventually I decided to check the status of the table being 
written to, and it was disabled - and not by me (AFAIK). 

I tried enabling the table via the hbase shell.. and it was taking a long  
time, so I left for the evening. I came back this morning, and the shell had 
reported:

hbase(main):002:0> enable 'filestore'
NativeException: java.io.IOException: Unable to enable table filestore

Except by now, the table was back up!

After going through the logs a little more closely, the only thing I can find 
that seems correlated (at least by the timing):

(in the namenode logs)

2010-07-28 18:39:17,213 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: 
filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
 Daughters; 
filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
 
filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
 from ubuntu-hadoop-3,60020,1280263369525; 1 of 1 

...

010-07-28 18:42:45,835 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
 no longer has references to 
filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
2010-07-28 18:42:45,842 INFO org.apache.hadoop.hbase.master.BaseScanner: 
Deleting region 
filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
 (encoded=1245524105) because daughter splits no longer hold references
...
2010-07-28 18:59:39,000 DEBUG org.apache.hadoop.hbase.master.ChangeTableState: 
Processing unserved regions
2010-07-28 18:59:39,001 DEBUG org.apache.hadoop.hbase.master.ChangeTableState: 
Skipping region REGION => {NAME => 
'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
 STARTKEY => 
'201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e', ENDKEY => 
'202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1', ENCODED => 
1808201339, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'filestore', 
FAMILIES => [{NAME => 'content', COMPRESSION => 'LZO', VERSIONS => '3', TTL => 
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 
'true'}]}} because it is offline and split
...
010-07-28 18:59:39,001 DEBUG org.apache.hadoop.hbase.master.ChangeTableState: 
Processing regions currently being served
2010-07-28 18:59:39,002 DEBUG org.apache.hadoop.hbase.master.ChangeTableState: 
Already online

...
010-07-28 19:00:34,485 INFO org.apache.hadoop.hbase.master.ServerManager: 4 
region servers, 0 dead, average load 1060.0
2010-07-28 19:00:49,850 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scanning meta region {server: 192.168.193.67:60020, 
regionname: -ROOT-,,0, startKey: <>}
2010-07-28 19:00:49,858 INFO org.apache.hadoop.hbase.master.BaseScanner: 
RegionManager.rootScanner scan of 1 row(s) of meta region {server: 
192.168.193.67:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-07-28 19:01:06,981 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
 no longer has references to 
filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
...
I'm not really sure, but I saw these messages toward the end:
...
2010-07-28 19:18:31,029 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
 no longer has references to 
filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
2010-07-28 19:18:31,061 INFO org.apache.hadoop.hbase.master.BaseScanner: 
Deleting region 
filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
 (encoded=597566178) because daughter splits no longer hold references
2010-07-28 19:18:31,061 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
DELETING region hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
...
Which may correspond to the time when it was recovering (if so, I just missed 
it coming back online).
...

As a final note, I re-ran some of the clients today, and it appears some are 
OK, and some consistently give:

Error: io exception when loading file: 
/tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
region server Some server, retryOnlyOne=true, index=0, islastrow=false, 
tries=9, numtries=10, i=3, listsize=7, 
region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
 for region 
filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
 row 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f', but 
failed after 10 attempts

So while the above is the error that brought the offline table to my attention 
- it may just be a separate bug? 

Not sure what causes it, but since it happens consistently in a program being 
run with one set of arguments, but not another, I'm thinking it's an error on 
my part.

Any ideas on what could cause the table to go offline?
Any common mistakes that lead to RetriesExhausted errors?

The Retry errors occurred in a shared method that uploads a file to the 
filestore, so I'm not sure what causes it to fail in one case, but not another. 
Maybe just the size of the file? (@300K).

Thanks!

Take care,
  -stu

Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to