Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Stuart Smith Thu, 29 Jul 2010 16:26:10 -0700

Hello Ryan,

  Thanks!


Just to verify - my xceiver count is 4K, my ulimit reports 64000, my datanode 
handler count is 15, my socket write timeout is zero, my swappiness is 1 on 
datanodes and 0 on the namenode, and my memory has been tweaked according to 
the machines - hadoop and hbase both get 3GB on 8GB RAM datanodes - leaving 2GB 
free. The namenode has 16GB and is split 6GB/6GB.

After my last round of issues I went through the faq & a bunch of blogs - of 
which some were yours I think - so thanks again :)

I get 

Warning: failed to flush data to sample store: Trying to contact region server 
Some server, retryOnlyOne=true, index=0, islastrow=false, tries=9, numtries=10, 
i=9, listsize=13, 
region=filestore,be4c6d071635b80ac649b7900167f6ddd7cc2dca3578ce8bc24fca523930e81c,1279956247376
 for region 
filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
 row 'be113824be800baddf62c27ac9cf12a57955a3582d7d8f53541017416cf18ed1', but 
failed after 10 attempts. 

 on about 20 files out of 6000 -  so right now I just redo the batch, and skip 
existing entries, which works for now.

What I think I need to do is come up with a nice set of java snippets that 
illustrate my code, and re-post. But that might not happen right away.

My app is this multi-threaded thingy that has thread pools with threads that 
have thread pools, ftps in archives, extracts them, checks for dupes, other 
stuff, and uploads files. Which is one reason I think it might be a client side 
thing ~ but I did wrap my puts with synchronized( table ) {} ;)

And, yes, for all the tweaking I've had to do on Hbase ~ it sure beats the time 
I needed to alter an innodb table with about 800 million rows of blobs & 
stuff... took about a week.

Take care,
  -stu



--- On Thu, 7/29/10, Ryan Rawson <[email protected]> wrote:

> From: Ryan Rawson <[email protected]>
> Subject: Re: Table goes offline - temporary outage + Retries Exhausted  
> (related?)
> To: [email protected]
> Date: Thursday, July 29, 2010, 6:40 PM
> Hi,
> 
> There is a lot going on in this email, the logs might look
> promising
> but they are standard split messages, not really indicative
> of
> anything going wrong.
> 
> It sounds like you might be coming across some of the
> standard foils
> that are well documented in here:
> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#overview_description
> 
> Perhaps you could confirm you have things like xceiver
> count, and
> ulimits set?    I personally use this on all my
> clusters, maybe you
> can try it again:
> <property>
> <name>dfs.datanode.socket.write.timeout</name>
> <value>0</value>
> </property>
> 
> Lastly, I dont think that Put should be unreliable, I have
> reliably
> imported 10s of billions of rows, so there is something
> else going on.
> 
> -ryan
> PS: mysql dbas spend tons of time setting up ulimits and
> other
> esoteric kernel tuning parameters, our requirement is
> actually
> surprisingly low in that regard.
> 
> On Thu, Jul 29, 2010 at 3:02 PM, Stuart Smith <[email protected]>
> wrote:
> > Hello all,
> >
> > It looks like I had an ensemble of unrelated errors.
> >
> > To follow up with the table going offline error:
> >
> > I noticed today the the gui will say: "Enabled False",
> and the shell will say:
> >
> > hbase(main):004:0> describe 'filestore'
> > DESCRIPTION                            
>                                 ENABLED
> >  {NAME => 'filestore', FAMILIES => [{NAME =>
> 'content', COMPRESSION =>  false
> >  'LZO', VERSIONS => '3', TTL => '2147483647',
> BLOCKSIZE => '65536', IN_
> >
> > Soo... I'm not sure which is which - maybe it was
> never disabled, depending on whether the gui or shell is
> correct. It appears to be the shell, since I've been
> uploading more data, and it's going through fine now.
> >
> > I'm guessing yesterday uploads were failing due to the
> batch issues, and the gui reported the table as disabled,
> and I connected the two issues incorrectly.
> >
> > Take care,
> >  -stu
> >
> > --- On Thu, 7/29/10, Stuart Smith <[email protected]>
> wrote:
> >
> >> From: Stuart Smith <[email protected]>
> >> Subject: Re: Table goes offline - temporary outage
> + Retries Exhausted (related?)
> >> To: [email protected]
> >> Date: Thursday, July 29, 2010, 3:19 PM
> >> To follow up on the retry error
> >> (still have no idea about the table going
> offline):
> >>
> >> It was coding error, sorta kinda.
> >>
> >> I was doing large batches with AutoFlush disabled,
> and
> >> flushing at the end, figuring I could gain
> performance, and
> >> just reprocess bad batches.
> >>
> >> Bad call.
> >>
> >> It appears I was consistently getting errors on
> flush, so
> >> the batch just kept failing. Now I flush after
> every
> >> successful file upload, and only one or two out of
> a couple
> >> thousand fail, and not consistently on one file,
> so retries
> >> are possible.
> >>
> >> I also added a 3 second sleep when I get some kind
> of
> >> IOException executing a a PUT on this particular
> table. To
> >> prevent some sort of cascade effect.
> >>
> >> That part is going pretty smooth now.
> >>
> >> Still don't know about the offline table thing -
> crossing
> >> my fingers and watching closely for now (and
> adding nodes).
> >>
> >> I guess the moral of the first lesson is to really
> treat
> >> the Puts() as somewhat unreliable?
> >>
> >> Take care,
> >>   -stu
> >>
> >>
> >>
> >> --- On Thu, 7/29/10, Stuart Smith <[email protected]>
> >> wrote:
> >>
> >> > From: Stuart Smith <[email protected]>
> >> > Subject: Table goes offline - temporary
> outage +
> >> Retries Exhausted (related?)
> >> > To: [email protected]
> >> > Date: Thursday, July 29, 2010, 2:09 PM
> >> > Hello,
> >> >    I have two problems that may or may
> not
> >> > be related.
> >> >
> >> > One is trying to figure out a self-correcting
> outage I
> >> had
> >> > last evening.
> >> >
> >> > I noticed issues starting with clients
> reporting:
> >> >
> >> > RetriesExhaustedException: Trying to contact
> region
> >> server
> >> > Some server...
> >> >
> >> > I didn't see much going on in the
> regionserver logs,
> >> except
> >> > for some major compactions. Eventually I
> decided to
> >> check
> >> > the status of the table being written to, and
> it was
> >> > disabled - and not by me (AFAIK).
> >> >
> >> > I tried enabling the table via the hbase
> shell.. and
> >> it was
> >> > taking a long  time, so I left for the
> evening. I
> >> came
> >> > back this morning, and the shell had
> reported:
> >> >
> >> > hbase(main):002:0> enable 'filestore'
> >> > NativeException: java.io.IOException: Unable
> to
> >> enable
> >> > table filestore
> >> >
> >> > Except by now, the table was back up!
> >> >
> >> > After going through the logs a little more
> closely,
> >> the
> >> > only thing I can find that seems correlated
> (at least
> >> by the
> >> > timing):
> >> >
> >> > (in the namenode logs)
> >> >
> >> > 2010-07-28 18:39:17,213 INFO
> >> >
> org.apache.hadoop.hbase.master.ServerManager:
> >> Processing
> >> > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> >> >
> >>
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
> >> > Daughters;
> >> >
> >>
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
> >> >
> >>
> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
> >> > from ubuntu-hadoop-3,60020,1280263369525; 1
> of 1
> >> >
> >> > ...
> >> >
> >> > 010-07-28 18:42:45,835 DEBUG
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> >> >
> >>
> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
> >> > no longer has references to
> >> >
> >>
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> >> > 2010-07-28 18:42:45,842 INFO
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> Deleting
> >> region
> >> >
> >>
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> >> > (encoded=1245524105) because daughter splits
> no longer
> >> hold
> >> > references
> >> > ...
> >> > 2010-07-28 18:59:39,000 DEBUG
> >> >
> org.apache.hadoop.hbase.master.ChangeTableState:
> >> Processing
> >> > unserved regions
> >> > 2010-07-28 18:59:39,001 DEBUG
> >> >
> org.apache.hadoop.hbase.master.ChangeTableState:
> >> Skipping
> >> > region REGION => {NAME =>
> >> >
> >>
> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
> >> > STARTKEY =>
> >> >
> >>
> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
> >> > ENDKEY =>
> >> >
> >>
> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
> >> > ENCODED => 1808201339, OFFLINE => true,
> SPLIT
> >> =>
> >> > true, TABLE => {{NAME => 'filestore',
> FAMILIES
> >> =>
> >> > [{NAME => 'content', COMPRESSION =>
> 'LZO',
> >> VERSIONS
> >> > => '3', TTL => '2147483647', BLOCKSIZE
> =>
> >> '65536',
> >> > IN_MEMORY => 'false', BLOCKCACHE =>
> 'true'}]}}
> >> because
> >> > it is offline and split
> >> > ...
> >> > 010-07-28 18:59:39,001 DEBUG
> >> >
> org.apache.hadoop.hbase.master.ChangeTableState:
> >> Processing
> >> > regions currently being served
> >> > 2010-07-28 18:59:39,002 DEBUG
> >> >
> org.apache.hadoop.hbase.master.ChangeTableState:
> >> Already
> >> > online
> >> >
> >> > ...
> >> > 010-07-28 19:00:34,485 INFO
> >> > org.apache.hadoop.hbase.master.ServerManager:
> 4
> >> region
> >> > servers, 0 dead, average load 1060.0
> >> > 2010-07-28 19:00:49,850 INFO
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> >> > RegionManager.rootScanner scanning meta
> region
> >> {server:
> >> > 192.168.193.67:60020, regionname: -ROOT-,,0,
> >> startKey:
> >> > <>}
> >> > 2010-07-28 19:00:49,858 INFO
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> >> > RegionManager.rootScanner scan of 1 row(s) of
> meta
> >> region
> >> > {server: 192.168.193.67:60020, regionname:
> -ROOT-,,0,
> >> > startKey: <>} complete
> >> > 2010-07-28 19:01:06,981 DEBUG
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> >> >
> >>
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
> >> > no longer has references to
> >> >
> >>
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
> >> > ...
> >> > I'm not really sure, but I saw these messages
> toward
> >> the
> >> > end:
> >> > ...
> >> > 2010-07-28 19:18:31,029 DEBUG
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> >> >
> >>
> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
> >> > no longer has references to
> >> >
> >>
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> >> > 2010-07-28 19:18:31,061 INFO
> >> > org.apache.hadoop.hbase.master.BaseScanner:
> Deleting
> >> region
> >> >
> >>
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> >> > (encoded=597566178) because daughter splits
> no longer
> >> hold
> >> > references
> >> > 2010-07-28 19:18:31,061 DEBUG
> >> >
> org.apache.hadoop.hbase.regionserver.HRegion:
> >> DELETING
> >> > region
> >> >
> >>
> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
> >> > ...
> >> > Which may correspond to the time when it was
> >> recovering (if
> >> > so, I just missed it coming back online).
> >> > ...
> >> >
> >> > As a final note, I re-ran some of the clients
> today,
> >> and it
> >> > appears some are OK, and some consistently
> give:
> >> >
> >> > Error: io exception when loading file:
> >> >
> /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
> >> >
> >>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> > Trying to contact region server Some server,
> >> > retryOnlyOne=true, index=0, islastrow=false,
> tries=9,
> >> > numtries=10, i=3, listsize=7,
> >> >
> >>
> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
> >> > for region
> >> >
> >>
> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
> >> > row
> >> >
> >>
> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
> >> > but failed after 10 attempts
> >> >
> >> > So while the above is the error that brought
> the
> >> offline
> >> > table to my attention - it may just be a
> separate bug?
> >>
> >> >
> >> > Not sure what causes it, but since it
> happens
> >> consistently
> >> > in a program being run with one set of
> arguments, but
> >> not
> >> > another, I'm thinking it's an error on my
> part.
> >> >
> >> > Any ideas on what could cause the table to
> go
> >> offline?
> >> > Any common mistakes that lead to
> RetriesExhausted
> >> errors?
> >> >
> >> > The Retry errors occurred in a shared method
> that
> >> uploads a
> >> > file to the filestore, so I'm not sure what
> causes it
> >> to
> >> > fail in one case, but not another. Maybe just
> the size
> >> of
> >> > the file? (@300K).
> >> >
> >> > Thanks!
> >> >
> >> > Take care,
> >> >   -stu
> >> >
> >> >
> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
> >
> >
>

Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to