Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Stuart Smith Thu, 29 Jul 2010 15:03:08 -0700

Hello all,

It looks like I had an ensemble of unrelated errors.


To follow up with the table going offline error:

I noticed today the the gui will say: "Enabled False", and the shell will say:

hbase(main):004:0> describe 'filestore'
DESCRIPTION                                                             ENABLED 
                              
 {NAME => 'filestore', FAMILIES => [{NAME => 'content', COMPRESSION =>  false   
                              
 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_ 

Soo... I'm not sure which is which - maybe it was never disabled, depending on 
whether the gui or shell is correct. It appears to be the shell, since I've 
been uploading more data, and it's going through fine now.

I'm guessing yesterday uploads were failing due to the batch issues, and the 
gui reported the table as disabled, and I connected the two issues incorrectly.

Take care,
  -stu

--- On Thu, 7/29/10, Stuart Smith <[email protected]> wrote:

> From: Stuart Smith <[email protected]>
> Subject: Re: Table goes offline - temporary outage + Retries Exhausted 
> (related?)
> To: [email protected]
> Date: Thursday, July 29, 2010, 3:19 PM
> To follow up on the retry error
> (still have no idea about the table going offline):
> 
> It was coding error, sorta kinda.
> 
> I was doing large batches with AutoFlush disabled, and
> flushing at the end, figuring I could gain performance, and
> just reprocess bad batches.
> 
> Bad call.
> 
> It appears I was consistently getting errors on flush, so
> the batch just kept failing. Now I flush after every
> successful file upload, and only one or two out of a couple
> thousand fail, and not consistently on one file, so retries
> are possible.
> 
> I also added a 3 second sleep when I get some kind of
> IOException executing a a PUT on this particular table. To
> prevent some sort of cascade effect.
> 
> That part is going pretty smooth now.
> 
> Still don't know about the offline table thing - crossing
> my fingers and watching closely for now (and adding nodes).
> 
> I guess the moral of the first lesson is to really treat
> the Puts() as somewhat unreliable?
> 
> Take care,
>   -stu
> 
> 
> 
> --- On Thu, 7/29/10, Stuart Smith <[email protected]>
> wrote:
> 
> > From: Stuart Smith <[email protected]>
> > Subject: Table goes offline - temporary outage +
> Retries Exhausted (related?)
> > To: [email protected]
> > Date: Thursday, July 29, 2010, 2:09 PM
> > Hello,
> >    I have two problems that may or may not
> > be related.
> > 
> > One is trying to figure out a self-correcting outage I
> had
> > last evening.
> > 
> > I noticed issues starting with clients reporting:
> > 
> > RetriesExhaustedException: Trying to contact region
> server
> > Some server...
> > 
> > I didn't see much going on in the regionserver logs,
> except
> > for some major compactions. Eventually I decided to
> check
> > the status of the table being written to, and it was
> > disabled - and not by me (AFAIK). 
> > 
> > I tried enabling the table via the hbase shell.. and
> it was
> > taking a long  time, so I left for the evening. I
> came
> > back this morning, and the shell had reported:
> > 
> > hbase(main):002:0> enable 'filestore'
> > NativeException: java.io.IOException: Unable to
> enable
> > table filestore
> > 
> > Except by now, the table was back up!
> > 
> > After going through the logs a little more closely,
> the
> > only thing I can find that seems correlated (at least
> by the
> > timing):
> > 
> > (in the namenode logs)
> > 
> > 2010-07-28 18:39:17,213 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> Processing
> > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> >
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
> > Daughters;
> >
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
> >
> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
> > from ubuntu-hadoop-3,60020,1280263369525; 1 of 1 
> > 
> > ...
> > 
> > 010-07-28 18:42:45,835 DEBUG
> > org.apache.hadoop.hbase.master.BaseScanner:
> >
> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
> > no longer has references to
> >
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> > 2010-07-28 18:42:45,842 INFO
> > org.apache.hadoop.hbase.master.BaseScanner: Deleting
> region
> >
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> > (encoded=1245524105) because daughter splits no longer
> hold
> > references
> > ...
> > 2010-07-28 18:59:39,000 DEBUG
> > org.apache.hadoop.hbase.master.ChangeTableState:
> Processing
> > unserved regions
> > 2010-07-28 18:59:39,001 DEBUG
> > org.apache.hadoop.hbase.master.ChangeTableState:
> Skipping
> > region REGION => {NAME =>
> >
> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
> > STARTKEY =>
> >
> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
> > ENDKEY =>
> >
> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
> > ENCODED => 1808201339, OFFLINE => true, SPLIT
> =>
> > true, TABLE => {{NAME => 'filestore', FAMILIES
> =>
> > [{NAME => 'content', COMPRESSION => 'LZO',
> VERSIONS
> > => '3', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> > IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
> because
> > it is offline and split
> > ...
> > 010-07-28 18:59:39,001 DEBUG
> > org.apache.hadoop.hbase.master.ChangeTableState:
> Processing
> > regions currently being served
> > 2010-07-28 18:59:39,002 DEBUG
> > org.apache.hadoop.hbase.master.ChangeTableState:
> Already
> > online
> > 
> > ...
> > 010-07-28 19:00:34,485 INFO
> > org.apache.hadoop.hbase.master.ServerManager: 4
> region
> > servers, 0 dead, average load 1060.0
> > 2010-07-28 19:00:49,850 INFO
> > org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scanning meta region
> {server:
> > 192.168.193.67:60020, regionname: -ROOT-,,0,
> startKey:
> > <>}
> > 2010-07-28 19:00:49,858 INFO
> > org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scan of 1 row(s) of meta
> region
> > {server: 192.168.193.67:60020, regionname: -ROOT-,,0,
> > startKey: <>} complete
> > 2010-07-28 19:01:06,981 DEBUG
> > org.apache.hadoop.hbase.master.BaseScanner:
> >
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
> > no longer has references to
> >
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
> > ...
> > I'm not really sure, but I saw these messages toward
> the
> > end:
> > ...
> > 2010-07-28 19:18:31,029 DEBUG
> > org.apache.hadoop.hbase.master.BaseScanner:
> >
> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
> > no longer has references to
> >
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> > 2010-07-28 19:18:31,061 INFO
> > org.apache.hadoop.hbase.master.BaseScanner: Deleting
> region
> >
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> > (encoded=597566178) because daughter splits no longer
> hold
> > references
> > 2010-07-28 19:18:31,061 DEBUG
> > org.apache.hadoop.hbase.regionserver.HRegion:
> DELETING
> > region
> >
> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
> > ...
> > Which may correspond to the time when it was
> recovering (if
> > so, I just missed it coming back online).
> > ...
> > 
> > As a final note, I re-ran some of the clients today,
> and it
> > appears some are OK, and some consistently give:
> > 
> > Error: io exception when loading file:
> > /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
> >
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> > Trying to contact region server Some server,
> > retryOnlyOne=true, index=0, islastrow=false, tries=9,
> > numtries=10, i=3, listsize=7,
> >
> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
> > for region
> >
> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
> > row
> >
> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
> > but failed after 10 attempts
> > 
> > So while the above is the error that brought the
> offline
> > table to my attention - it may just be a separate bug?
> 
> > 
> > Not sure what causes it, but since it happens
> consistently
> > in a program being run with one set of arguments, but
> not
> > another, I'm thinking it's an error on my part.
> > 
> > Any ideas on what could cause the table to go
> offline?
> > Any common mistakes that lead to RetriesExhausted
> errors?
> > 
> > The Retry errors occurred in a shared method that
> uploads a
> > file to the filestore, so I'm not sure what causes it
> to
> > fail in one case, but not another. Maybe just the size
> of
> > the file? (@300K).
> > 
> > Thanks!
> > 
> > Take care,
> >   -stu
> >  
> > 

> > 
> > 
> >       
> > 
> 
> 
> 
>

Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to