Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Ryan Rawson Thu, 29 Jul 2010 15:41:25 -0700

Hi,

There is a lot going on in this email, the logs might look promising
but they are standard split messages, not really indicative of
anything going wrong.


It sounds like you might be coming across some of the standard foils
that are well documented in here:
http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#overview_description

Perhaps you could confirm you have things like xceiver count, and
ulimits set?    I personally use this on all my clusters, maybe you
can try it again:
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>

Lastly, I dont think that Put should be unreliable, I have reliably
imported 10s of billions of rows, so there is something else going on.

-ryan
PS: mysql dbas spend tons of time setting up ulimits and other
esoteric kernel tuning parameters, our requirement is actually
surprisingly low in that regard.

On Thu, Jul 29, 2010 at 3:02 PM, Stuart Smith <[email protected]> wrote:
> Hello all,
>
> It looks like I had an ensemble of unrelated errors.
>
> To follow up with the table going offline error:
>
> I noticed today the the gui will say: "Enabled False", and the shell will say:
>
> hbase(main):004:0> describe 'filestore'
> DESCRIPTION                                                             
> ENABLED
>  {NAME => 'filestore', FAMILIES => [{NAME => 'content', COMPRESSION =>  false
>  'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_
>
> Soo... I'm not sure which is which - maybe it was never disabled, depending 
> on whether the gui or shell is correct. It appears to be the shell, since 
> I've been uploading more data, and it's going through fine now.
>
> I'm guessing yesterday uploads were failing due to the batch issues, and the 
> gui reported the table as disabled, and I connected the two issues 
> incorrectly.
>
> Take care,
>  -stu
>
> --- On Thu, 7/29/10, Stuart Smith <[email protected]> wrote:
>
>> From: Stuart Smith <[email protected]>
>> Subject: Re: Table goes offline - temporary outage + Retries Exhausted 
>> (related?)
>> To: [email protected]
>> Date: Thursday, July 29, 2010, 3:19 PM
>> To follow up on the retry error
>> (still have no idea about the table going offline):
>>
>> It was coding error, sorta kinda.
>>
>> I was doing large batches with AutoFlush disabled, and
>> flushing at the end, figuring I could gain performance, and
>> just reprocess bad batches.
>>
>> Bad call.
>>
>> It appears I was consistently getting errors on flush, so
>> the batch just kept failing. Now I flush after every
>> successful file upload, and only one or two out of a couple
>> thousand fail, and not consistently on one file, so retries
>> are possible.
>>
>> I also added a 3 second sleep when I get some kind of
>> IOException executing a a PUT on this particular table. To
>> prevent some sort of cascade effect.
>>
>> That part is going pretty smooth now.
>>
>> Still don't know about the offline table thing - crossing
>> my fingers and watching closely for now (and adding nodes).
>>
>> I guess the moral of the first lesson is to really treat
>> the Puts() as somewhat unreliable?
>>
>> Take care,
>>   -stu
>>
>>
>>
>> --- On Thu, 7/29/10, Stuart Smith <[email protected]>
>> wrote:
>>
>> > From: Stuart Smith <[email protected]>
>> > Subject: Table goes offline - temporary outage +
>> Retries Exhausted (related?)
>> > To: [email protected]
>> > Date: Thursday, July 29, 2010, 2:09 PM
>> > Hello,
>> >    I have two problems that may or may not
>> > be related.
>> >
>> > One is trying to figure out a self-correcting outage I
>> had
>> > last evening.
>> >
>> > I noticed issues starting with clients reporting:
>> >
>> > RetriesExhaustedException: Trying to contact region
>> server
>> > Some server...
>> >
>> > I didn't see much going on in the regionserver logs,
>> except
>> > for some major compactions. Eventually I decided to
>> check
>> > the status of the table being written to, and it was
>> > disabled - and not by me (AFAIK).
>> >
>> > I tried enabling the table via the hbase shell.. and
>> it was
>> > taking a long  time, so I left for the evening. I
>> came
>> > back this morning, and the shell had reported:
>> >
>> > hbase(main):002:0> enable 'filestore'
>> > NativeException: java.io.IOException: Unable to
>> enable
>> > table filestore
>> >
>> > Except by now, the table was back up!
>> >
>> > After going through the logs a little more closely,
>> the
>> > only thing I can find that seems correlated (at least
>> by the
>> > timing):
>> >
>> > (in the namenode logs)
>> >
>> > 2010-07-28 18:39:17,213 INFO
>> > org.apache.hadoop.hbase.master.ServerManager:
>> Processing
>> > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
>> >
>> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
>> > Daughters;
>> >
>> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
>> >
>> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
>> > from ubuntu-hadoop-3,60020,1280263369525; 1 of 1
>> >
>> > ...
>> >
>> > 010-07-28 18:42:45,835 DEBUG
>> > org.apache.hadoop.hbase.master.BaseScanner:
>> >
>> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
>> > no longer has references to
>> >
>> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
>> > 2010-07-28 18:42:45,842 INFO
>> > org.apache.hadoop.hbase.master.BaseScanner: Deleting
>> region
>> >
>> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
>> > (encoded=1245524105) because daughter splits no longer
>> hold
>> > references
>> > ...
>> > 2010-07-28 18:59:39,000 DEBUG
>> > org.apache.hadoop.hbase.master.ChangeTableState:
>> Processing
>> > unserved regions
>> > 2010-07-28 18:59:39,001 DEBUG
>> > org.apache.hadoop.hbase.master.ChangeTableState:
>> Skipping
>> > region REGION => {NAME =>
>> >
>> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
>> > STARTKEY =>
>> >
>> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
>> > ENDKEY =>
>> >
>> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
>> > ENCODED => 1808201339, OFFLINE => true, SPLIT
>> =>
>> > true, TABLE => {{NAME => 'filestore', FAMILIES
>> =>
>> > [{NAME => 'content', COMPRESSION => 'LZO',
>> VERSIONS
>> > => '3', TTL => '2147483647', BLOCKSIZE =>
>> '65536',
>> > IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
>> because
>> > it is offline and split
>> > ...
>> > 010-07-28 18:59:39,001 DEBUG
>> > org.apache.hadoop.hbase.master.ChangeTableState:
>> Processing
>> > regions currently being served
>> > 2010-07-28 18:59:39,002 DEBUG
>> > org.apache.hadoop.hbase.master.ChangeTableState:
>> Already
>> > online
>> >
>> > ...
>> > 010-07-28 19:00:34,485 INFO
>> > org.apache.hadoop.hbase.master.ServerManager: 4
>> region
>> > servers, 0 dead, average load 1060.0
>> > 2010-07-28 19:00:49,850 INFO
>> > org.apache.hadoop.hbase.master.BaseScanner:
>> > RegionManager.rootScanner scanning meta region
>> {server:
>> > 192.168.193.67:60020, regionname: -ROOT-,,0,
>> startKey:
>> > <>}
>> > 2010-07-28 19:00:49,858 INFO
>> > org.apache.hadoop.hbase.master.BaseScanner:
>> > RegionManager.rootScanner scan of 1 row(s) of meta
>> region
>> > {server: 192.168.193.67:60020, regionname: -ROOT-,,0,
>> > startKey: <>} complete
>> > 2010-07-28 19:01:06,981 DEBUG
>> > org.apache.hadoop.hbase.master.BaseScanner:
>> >
>> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
>> > no longer has references to
>> >
>> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
>> > ...
>> > I'm not really sure, but I saw these messages toward
>> the
>> > end:
>> > ...
>> > 2010-07-28 19:18:31,029 DEBUG
>> > org.apache.hadoop.hbase.master.BaseScanner:
>> >
>> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
>> > no longer has references to
>> >
>> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
>> > 2010-07-28 19:18:31,061 INFO
>> > org.apache.hadoop.hbase.master.BaseScanner: Deleting
>> region
>> >
>> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
>> > (encoded=597566178) because daughter splits no longer
>> hold
>> > references
>> > 2010-07-28 19:18:31,061 DEBUG
>> > org.apache.hadoop.hbase.regionserver.HRegion:
>> DELETING
>> > region
>> >
>> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
>> > ...
>> > Which may correspond to the time when it was
>> recovering (if
>> > so, I just missed it coming back online).
>> > ...
>> >
>> > As a final note, I re-ran some of the clients today,
>> and it
>> > appears some are OK, and some consistently give:
>> >
>> > Error: io exception when loading file:
>> > /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
>> >
>> org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> > Trying to contact region server Some server,
>> > retryOnlyOne=true, index=0, islastrow=false, tries=9,
>> > numtries=10, i=3, listsize=7,
>> >
>> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
>> > for region
>> >
>> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
>> > row
>> >
>> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
>> > but failed after 10 attempts
>> >
>> > So while the above is the error that brought the
>> offline
>> > table to my attention - it may just be a separate bug?
>>
>> >
>> > Not sure what causes it, but since it happens
>> consistently
>> > in a program being run with one set of arguments, but
>> not
>> > another, I'm thinking it's an error on my part.
>> >
>> > Any ideas on what could cause the table to go
>> offline?
>> > Any common mistakes that lead to RetriesExhausted
>> errors?
>> >
>> > The Retry errors occurred in a shared method that
>> uploads a
>> > file to the filestore, so I'm not sure what causes it
>> to
>> > fail in one case, but not another. Maybe just the size
>> of
>> > the file? (@300K).
>> >
>> > Thanks!
>> >
>> > Take care,
>> >   -stu
>> >
>> >
>
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>
>
>
>

Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to