Hello,
I have two problems that may or may not be related.
One is trying to figure out a self-correcting outage I had last evening.
I noticed issues starting with clients reporting:
RetriesExhaustedException: Trying to contact region server Some server...
I didn't see much going on in the regionserver logs, except for some major
compactions. Eventually I decided to check the status of the table being
written to, and it was disabled - and not by me (AFAIK).
I tried enabling the table via the hbase shell.. and it was taking a long
time, so I left for the evening. I came back this morning, and the shell had
reported:
hbase(main):002:0> enable 'filestore'
NativeException: java.io.IOException: Unable to enable table filestore
Except by now, the table was back up!
After going through the logs a little more closely, the only thing I can find
that seems correlated (at least by the timing):
(in the namenode logs)
2010-07-28 18:39:17,213 INFO org.apache.hadoop.hbase.master.ServerManager:
Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
Daughters;
filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
from ubuntu-hadoop-3,60020,1280263369525; 1 of 1
...
010-07-28 18:42:45,835 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
no longer has references to
filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
2010-07-28 18:42:45,842 INFO org.apache.hadoop.hbase.master.BaseScanner:
Deleting region
filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
(encoded=1245524105) because daughter splits no longer hold references
...
2010-07-28 18:59:39,000 DEBUG org.apache.hadoop.hbase.master.ChangeTableState:
Processing unserved regions
2010-07-28 18:59:39,001 DEBUG org.apache.hadoop.hbase.master.ChangeTableState:
Skipping region REGION => {NAME =>
'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
STARTKEY =>
'201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e', ENDKEY =>
'202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1', ENCODED =>
1808201339, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'filestore',
FAMILIES => [{NAME => 'content', COMPRESSION => 'LZO', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
'true'}]}} because it is offline and split
...
010-07-28 18:59:39,001 DEBUG org.apache.hadoop.hbase.master.ChangeTableState:
Processing regions currently being served
2010-07-28 18:59:39,002 DEBUG org.apache.hadoop.hbase.master.ChangeTableState:
Already online
...
010-07-28 19:00:34,485 INFO org.apache.hadoop.hbase.master.ServerManager: 4
region servers, 0 dead, average load 1060.0
2010-07-28 19:00:49,850 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 192.168.193.67:60020,
regionname: -ROOT-,,0, startKey: <>}
2010-07-28 19:00:49,858 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
192.168.193.67:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-07-28 19:01:06,981 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
no longer has references to
filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
...
I'm not really sure, but I saw these messages toward the end:
...
2010-07-28 19:18:31,029 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
no longer has references to
filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
2010-07-28 19:18:31,061 INFO org.apache.hadoop.hbase.master.BaseScanner:
Deleting region
filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
(encoded=597566178) because daughter splits no longer hold references
2010-07-28 19:18:31,061 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
DELETING region hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
...
Which may correspond to the time when it was recovering (if so, I just missed
it coming back online).
...
As a final note, I re-ran some of the clients today, and it appears some are
OK, and some consistently give:
Error: io exception when loading file:
/tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server Some server, retryOnlyOne=true, index=0, islastrow=false,
tries=9, numtries=10, i=3, listsize=7,
region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
for region
filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
row 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f', but
failed after 10 attempts
So while the above is the error that brought the offline table to my attention
- it may just be a separate bug?
Not sure what causes it, but since it happens consistently in a program being
run with one set of arguments, but not another, I'm thinking it's an error on
my part.
Any ideas on what could cause the table to go offline?
Any common mistakes that lead to RetriesExhausted errors?
The Retry errors occurred in a shared method that uploads a file to the
filestore, so I'm not sure what causes it to fail in one case, but not another.
Maybe just the size of the file? (@300K).
Thanks!
Take care,
-stu