Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Stuart Smith Thu, 29 Jul 2010 12:20:12 -0700

To follow up on the retry error (still have no idea about the table going 
offline):


It was coding error, sorta kinda.

I was doing large batches with AutoFlush disabled, and flushing at the end, 
figuring I could gain performance, and just reprocess bad batches.

Bad call.

It appears I was consistently getting errors on flush, so the batch just kept 
failing. Now I flush after every successful file upload, and only one or two 
out of a couple thousand fail, and not consistently on one file, so retries are 
possible.

I also added a 3 second sleep when I get some kind of IOException executing a a 
PUT on this particular table. To prevent some sort of cascade effect.

That part is going pretty smooth now.

Still don't know about the offline table thing - crossing my fingers and 
watching closely for now (and adding nodes).

I guess the moral of the first lesson is to really treat the Puts() as somewhat 
unreliable?

Take care,
  -stu



--- On Thu, 7/29/10, Stuart Smith <[email protected]> wrote:

> From: Stuart Smith <[email protected]>
> Subject: Table goes offline - temporary outage + Retries Exhausted (related?)
> To: [email protected]
> Date: Thursday, July 29, 2010, 2:09 PM
> Hello,
>    I have two problems that may or may not
> be related.
> 
> One is trying to figure out a self-correcting outage I had
> last evening.
> 
> I noticed issues starting with clients reporting:
> 
> RetriesExhaustedException: Trying to contact region server
> Some server...
> 
> I didn't see much going on in the regionserver logs, except
> for some major compactions. Eventually I decided to check
> the status of the table being written to, and it was
> disabled - and not by me (AFAIK). 
> 
> I tried enabling the table via the hbase shell.. and it was
> taking a long  time, so I left for the evening. I came
> back this morning, and the shell had reported:
> 
> hbase(main):002:0> enable 'filestore'
> NativeException: java.io.IOException: Unable to enable
> table filestore
> 
> Except by now, the table was back up!
> 
> After going through the logs a little more closely, the
> only thing I can find that seems correlated (at least by the
> timing):
> 
> (in the namenode logs)
> 
> 2010-07-28 18:39:17,213 INFO
> org.apache.hadoop.hbase.master.ServerManager: Processing
> MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
> Daughters;
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
> from ubuntu-hadoop-3,60020,1280263369525; 1 of 1 
> 
> ...
> 
> 010-07-28 18:42:45,835 DEBUG
> org.apache.hadoop.hbase.master.BaseScanner:
> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
> no longer has references to
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> 2010-07-28 18:42:45,842 INFO
> org.apache.hadoop.hbase.master.BaseScanner: Deleting region
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> (encoded=1245524105) because daughter splits no longer hold
> references
> ...
> 2010-07-28 18:59:39,000 DEBUG
> org.apache.hadoop.hbase.master.ChangeTableState: Processing
> unserved regions
> 2010-07-28 18:59:39,001 DEBUG
> org.apache.hadoop.hbase.master.ChangeTableState: Skipping
> region REGION => {NAME =>
> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
> STARTKEY =>
> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
> ENDKEY =>
> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
> ENCODED => 1808201339, OFFLINE => true, SPLIT =>
> true, TABLE => {{NAME => 'filestore', FAMILIES =>
> [{NAME => 'content', COMPRESSION => 'LZO', VERSIONS
> => '3', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} because
> it is offline and split
> ...
> 010-07-28 18:59:39,001 DEBUG
> org.apache.hadoop.hbase.master.ChangeTableState: Processing
> regions currently being served
> 2010-07-28 18:59:39,002 DEBUG
> org.apache.hadoop.hbase.master.ChangeTableState: Already
> online
> 
> ...
> 010-07-28 19:00:34,485 INFO
> org.apache.hadoop.hbase.master.ServerManager: 4 region
> servers, 0 dead, average load 1060.0
> 2010-07-28 19:00:49,850 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server:
> 192.168.193.67:60020, regionname: -ROOT-,,0, startKey:
> <>}
> 2010-07-28 19:00:49,858 INFO
> org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region
> {server: 192.168.193.67:60020, regionname: -ROOT-,,0,
> startKey: <>} complete
> 2010-07-28 19:01:06,981 DEBUG
> org.apache.hadoop.hbase.master.BaseScanner:
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
> no longer has references to
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
> ...
> I'm not really sure, but I saw these messages toward the
> end:
> ...
> 2010-07-28 19:18:31,029 DEBUG
> org.apache.hadoop.hbase.master.BaseScanner:
> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
> no longer has references to
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> 2010-07-28 19:18:31,061 INFO
> org.apache.hadoop.hbase.master.BaseScanner: Deleting region
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> (encoded=597566178) because daughter splits no longer hold
> references
> 2010-07-28 19:18:31,061 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: DELETING
> region
> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
> ...
> Which may correspond to the time when it was recovering (if
> so, I just missed it coming back online).
> ...
> 
> As a final note, I re-ran some of the clients today, and it
> appears some are OK, and some consistently give:
> 
> Error: io exception when loading file:
> /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Trying to contact region server Some server,
> retryOnlyOne=true, index=0, islastrow=false, tries=9,
> numtries=10, i=3, listsize=7,
> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
> for region
> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
> row
> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
> but failed after 10 attempts
> 
> So while the above is the error that brought the offline
> table to my attention - it may just be a separate bug? 
> 
> Not sure what causes it, but since it happens consistently
> in a program being run with one set of arguments, but not
> another, I'm thinking it's an error on my part.
> 
> Any ideas on what could cause the table to go offline?
> Any common mistakes that lead to RetriesExhausted errors?
> 
> The Retry errors occurred in a shared method that uploads a
> file to the filestore, so I'm not sure what causes it to
> fail in one case, but not another. Maybe just the size of
> the file? (@300K).
> 
> Thanks!
> 
> Take care,
>   -stu
>  
> 
> 
> 
>       
>

Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to