Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Stuart Smith Thu, 29 Jul 2010 19:20:35 -0700

Hello Ryan,

  I'll get the logs together tomorrow.


I had to spend the last couple hours getting a static ip block & bringing the 
cluster up & down.

Thanks for the help & enthusiasm!

Take care,
  -stu


--- On Thu, 7/29/10, Ryan Rawson <[email protected]> wrote:

> From: Ryan Rawson <[email protected]>
> Subject: Re: Table goes offline - temporary outage + Retries Exhausted  
> (related?)
> To: [email protected]
> Date: Thursday, July 29, 2010, 7:36 PM
> There is some root cause behind the
> 'failed to flush' message... I'd
> like to get to the root of that.  Unfortunately it
> means lots of log
> groveling.  If you want to post logs, try pastebin.com
> instead of
> trying to attach files.
> 
> Dig some dirt up and lets check it out :-)
> 
> -ryan
> 
> On Thu, Jul 29, 2010 at 4:25 PM, Stuart Smith <[email protected]>
> wrote:
> > Hello Ryan,
> >
> >  Thanks!
> >
> > Just to verify - my xceiver count is 4K, my ulimit
> reports 64000, my datanode handler count is 15, my socket
> write timeout is zero, my swappiness is 1 on datanodes and 0
> on the namenode, and my memory has been tweaked according to
> the machines - hadoop and hbase both get 3GB on 8GB RAM
> datanodes - leaving 2GB free. The namenode has 16GB and is
> split 6GB/6GB.
> >
> > After my last round of issues I went through the faq
> & a bunch of blogs - of which some were yours I think -
> so thanks again :)
> >
> > I get
> >
> > Warning: failed to flush data to sample store: Trying
> to contact region server Some server, retryOnlyOne=true,
> index=0, islastrow=false, tries=9, numtries=10, i=9,
> listsize=13,
> region=filestore,be4c6d071635b80ac649b7900167f6ddd7cc2dca3578ce8bc24fca523930e81c,1279956247376
> for region
> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
> row
> 'be113824be800baddf62c27ac9cf12a57955a3582d7d8f53541017416cf18ed1',
> but failed after 10 attempts.
> >
> >  on about 20 files out of 6000 -  so right now I
> just redo the batch, and skip existing entries, which works
> for now.
> >
> > What I think I need to do is come up with a nice set
> of java snippets that illustrate my code, and re-post. But
> that might not happen right away.
> >
> > My app is this multi-threaded thingy that has thread
> pools with threads that have thread pools, ftps in archives,
> extracts them, checks for dupes, other stuff, and uploads
> files. Which is one reason I think it might be a client side
> thing ~ but I did wrap my puts with synchronized( table ) {}
> ;)
> >
> > And, yes, for all the tweaking I've had to do on Hbase
> ~ it sure beats the time I needed to alter an innodb table
> with about 800 million rows of blobs & stuff... took
> about a week.
> >
> > Take care,
> >  -stu
> >
> >
> >
> > --- On Thu, 7/29/10, Ryan Rawson <[email protected]>
> wrote:
> >
> >> From: Ryan Rawson <[email protected]>
> >> Subject: Re: Table goes offline - temporary outage
> + Retries Exhausted  (related?)
> >> To: [email protected]
> >> Date: Thursday, July 29, 2010, 6:40 PM
> >> Hi,
> >>
> >> There is a lot going on in this email, the logs
> might look
> >> promising
> >> but they are standard split messages, not really
> indicative
> >> of
> >> anything going wrong.
> >>
> >> It sounds like you might be coming across some of
> the
> >> standard foils
> >> that are well documented in here:
> >> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#overview_description
> >>
> >> Perhaps you could confirm you have things like
> xceiver
> >> count, and
> >> ulimits set?    I personally use this on all my
> >> clusters, maybe you
> >> can try it again:
> >> <property>
> >>
> <name>dfs.datanode.socket.write.timeout</name>
> >> <value>0</value>
> >> </property>
> >>
> >> Lastly, I dont think that Put should be
> unreliable, I have
> >> reliably
> >> imported 10s of billions of rows, so there is
> something
> >> else going on.
> >>
> >> -ryan
> >> PS: mysql dbas spend tons of time setting up
> ulimits and
> >> other
> >> esoteric kernel tuning parameters, our requirement
> is
> >> actually
> >> surprisingly low in that regard.
> >>
> >> On Thu, Jul 29, 2010 at 3:02 PM, Stuart Smith
> <[email protected]>
> >> wrote:
> >> > Hello all,
> >> >
> >> > It looks like I had an ensemble of unrelated
> errors.
> >> >
> >> > To follow up with the table going offline
> error:
> >> >
> >> > I noticed today the the gui will say:
> "Enabled False",
> >> and the shell will say:
> >> >
> >> > hbase(main):004:0> describe 'filestore'
> >> > DESCRIPTION
> >>                                
> ENABLED
> >> >  {NAME => 'filestore', FAMILIES =>
> [{NAME =>
> >> 'content', COMPRESSION =>  false
> >> >  'LZO', VERSIONS => '3', TTL =>
> '2147483647',
> >> BLOCKSIZE => '65536', IN_
> >> >
> >> > Soo... I'm not sure which is which - maybe it
> was
> >> never disabled, depending on whether the gui or
> shell is
> >> correct. It appears to be the shell, since I've
> been
> >> uploading more data, and it's going through fine
> now.
> >> >
> >> > I'm guessing yesterday uploads were failing
> due to the
> >> batch issues, and the gui reported the table as
> disabled,
> >> and I connected the two issues incorrectly.
> >> >
> >> > Take care,
> >> >  -stu
> >> >
> >> > --- On Thu, 7/29/10, Stuart Smith <[email protected]>
> >> wrote:
> >> >
> >> >> From: Stuart Smith <[email protected]>
> >> >> Subject: Re: Table goes offline -
> temporary outage
> >> + Retries Exhausted (related?)
> >> >> To: [email protected]
> >> >> Date: Thursday, July 29, 2010, 3:19 PM
> >> >> To follow up on the retry error
> >> >> (still have no idea about the table
> going
> >> offline):
> >> >>
> >> >> It was coding error, sorta kinda.
> >> >>
> >> >> I was doing large batches with AutoFlush
> disabled,
> >> and
> >> >> flushing at the end, figuring I could
> gain
> >> performance, and
> >> >> just reprocess bad batches.
> >> >>
> >> >> Bad call.
> >> >>
> >> >> It appears I was consistently getting
> errors on
> >> flush, so
> >> >> the batch just kept failing. Now I flush
> after
> >> every
> >> >> successful file upload, and only one or
> two out of
> >> a couple
> >> >> thousand fail, and not consistently on
> one file,
> >> so retries
> >> >> are possible.
> >> >>
> >> >> I also added a 3 second sleep when I get
> some kind
> >> of
> >> >> IOException executing a a PUT on this
> particular
> >> table. To
> >> >> prevent some sort of cascade effect.
> >> >>
> >> >> That part is going pretty smooth now.
> >> >>
> >> >> Still don't know about the offline table
> thing -
> >> crossing
> >> >> my fingers and watching closely for now
> (and
> >> adding nodes).
> >> >>
> >> >> I guess the moral of the first lesson is
> to really
> >> treat
> >> >> the Puts() as somewhat unreliable?
> >> >>
> >> >> Take care,
> >> >>   -stu
> >> >>
> >> >>
> >> >>
> >> >> --- On Thu, 7/29/10, Stuart Smith <[email protected]>
> >> >> wrote:
> >> >>
> >> >> > From: Stuart Smith <[email protected]>
> >> >> > Subject: Table goes offline -
> temporary
> >> outage +
> >> >> Retries Exhausted (related?)
> >> >> > To: [email protected]
> >> >> > Date: Thursday, July 29, 2010, 2:09
> PM
> >> >> > Hello,
> >> >> >    I have two problems that may
> or may
> >> not
> >> >> > be related.
> >> >> >
> >> >> > One is trying to figure out a
> self-correcting
> >> outage I
> >> >> had
> >> >> > last evening.
> >> >> >
> >> >> > I noticed issues starting with
> clients
> >> reporting:
> >> >> >
> >> >> > RetriesExhaustedException: Trying to
> contact
> >> region
> >> >> server
> >> >> > Some server...
> >> >> >
> >> >> > I didn't see much going on in the
> >> regionserver logs,
> >> >> except
> >> >> > for some major compactions.
> Eventually I
> >> decided to
> >> >> check
> >> >> > the status of the table being
> written to, and
> >> it was
> >> >> > disabled - and not by me (AFAIK).
> >> >> >
> >> >> > I tried enabling the table via the
> hbase
> >> shell.. and
> >> >> it was
> >> >> > taking a long  time, so I left for
> the
> >> evening. I
> >> >> came
> >> >> > back this morning, and the shell
> had
> >> reported:
> >> >> >
> >> >> > hbase(main):002:0> enable
> 'filestore'
> >> >> > NativeException:
> java.io.IOException: Unable
> >> to
> >> >> enable
> >> >> > table filestore
> >> >> >
> >> >> > Except by now, the table was back
> up!
> >> >> >
> >> >> > After going through the logs a
> little more
> >> closely,
> >> >> the
> >> >> > only thing I can find that seems
> correlated
> >> (at least
> >> >> by the
> >> >> > timing):
> >> >> >
> >> >> > (in the namenode logs)
> >> >> >
> >> >> > 2010-07-28 18:39:17,213 INFO
> >> >> >
> >> org.apache.hadoop.hbase.master.ServerManager:
> >> >> Processing
> >> >> >
> MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> >> >> >
> >> >>
> >>
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873:
> >> >> > Daughters;
> >> >> >
> >> >>
> >>
> filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232,
> >> >> >
> >> >>
> >>
> filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232
> >> >> > from
> ubuntu-hadoop-3,60020,1280263369525; 1
> >> of 1
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > 010-07-28 18:42:45,835 DEBUG
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> >> >
> >> >>
> >>
> filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191
> >> >> > no longer has references to
> >> >> >
> >> >>
> >>
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> >> >> > 2010-07-28 18:42:45,842 INFO
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> Deleting
> >> >> region
> >> >> >
> >> >>
> >>
> filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171
> >> >> > (encoded=1245524105) because
> daughter splits
> >> no longer
> >> >> hold
> >> >> > references
> >> >> > ...
> >> >> > 2010-07-28 18:59:39,000 DEBUG
> >> >> >
> >> org.apache.hadoop.hbase.master.ChangeTableState:
> >> >> Processing
> >> >> > unserved regions
> >> >> > 2010-07-28 18:59:39,001 DEBUG
> >> >> >
> >> org.apache.hadoop.hbase.master.ChangeTableState:
> >> >> Skipping
> >> >> > region REGION => {NAME =>
> >> >> >
> >> >>
> >>
> 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169',
> >> >> > STARTKEY =>
> >> >> >
> >> >>
> >>
> '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e',
> >> >> > ENDKEY =>
> >> >> >
> >> >>
> >>
> '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1',
> >> >> > ENCODED => 1808201339, OFFLINE
> => true,
> >> SPLIT
> >> >> =>
> >> >> > true, TABLE => {{NAME =>
> 'filestore',
> >> FAMILIES
> >> >> =>
> >> >> > [{NAME => 'content', COMPRESSION
> =>
> >> 'LZO',
> >> >> VERSIONS
> >> >> > => '3', TTL => '2147483647',
> BLOCKSIZE
> >> =>
> >> >> '65536',
> >> >> > IN_MEMORY => 'false', BLOCKCACHE
> =>
> >> 'true'}]}}
> >> >> because
> >> >> > it is offline and split
> >> >> > ...
> >> >> > 010-07-28 18:59:39,001 DEBUG
> >> >> >
> >> org.apache.hadoop.hbase.master.ChangeTableState:
> >> >> Processing
> >> >> > regions currently being served
> >> >> > 2010-07-28 18:59:39,002 DEBUG
> >> >> >
> >> org.apache.hadoop.hbase.master.ChangeTableState:
> >> >> Already
> >> >> > online
> >> >> >
> >> >> > ...
> >> >> > 010-07-28 19:00:34,485 INFO
> >> >> >
> org.apache.hadoop.hbase.master.ServerManager:
> >> 4
> >> >> region
> >> >> > servers, 0 dead, average load
> 1060.0
> >> >> > 2010-07-28 19:00:49,850 INFO
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> >> > RegionManager.rootScanner scanning
> meta
> >> region
> >> >> {server:
> >> >> > 192.168.193.67:60020, regionname:
> -ROOT-,,0,
> >> >> startKey:
> >> >> > <>}
> >> >> > 2010-07-28 19:00:49,858 INFO
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> >> > RegionManager.rootScanner scan of 1
> row(s) of
> >> meta
> >> >> region
> >> >> > {server: 192.168.193.67:60020,
> regionname:
> >> -ROOT-,,0,
> >> >> > startKey: <>} complete
> >> >> > 2010-07-28 19:01:06,981 DEBUG
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> >> >
> >> >>
> >>
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173
> >> >> > no longer has references to
> >> >> >
> >> >>
> >>
> filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956
> >> >> > ...
> >> >> > I'm not really sure, but I saw these
> messages
> >> toward
> >> >> the
> >> >> > end:
> >> >> > ...
> >> >> > 2010-07-28 19:18:31,029 DEBUG
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> >> >
> >> >>
> >>
> filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061
> >> >> > no longer has references to
> >> >> >
> >> >>
> >>
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> >> >> > 2010-07-28 19:18:31,061 INFO
> >> >> >
> org.apache.hadoop.hbase.master.BaseScanner:
> >> Deleting
> >> >> region
> >> >> >
> >> >>
> >>
> filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326
> >> >> > (encoded=597566178) because daughter
> splits
> >> no longer
> >> >> hold
> >> >> > references
> >> >> > 2010-07-28 19:18:31,061 DEBUG
> >> >> >
> >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> >> DELETING
> >> >> > region
> >> >> >
> >> >>
> >>
> hdfs://ubuntu-namenode:54310/hbase/filestore/597566178
> >> >> > ...
> >> >> > Which may correspond to the time
> when it was
> >> >> recovering (if
> >> >> > so, I just missed it coming back
> online).
> >> >> > ...
> >> >> >
> >> >> > As a final note, I re-ran some of
> the clients
> >> today,
> >> >> and it
> >> >> > appears some are OK, and some
> consistently
> >> give:
> >> >> >
> >> >> > Error: io exception when loading
> file:
> >> >> >
> >>
> /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi
> >> >> >
> >> >>
> >>
> org.apache.hadoop.hbase.client.RetriesExhaustedException:
> >> >> > Trying to contact region server Some
> server,
> >> >> > retryOnlyOne=true, index=0,
> islastrow=false,
> >> tries=9,
> >> >> > numtries=10, i=3, listsize=7,
> >> >> >
> >> >>
> >>
> region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220
> >> >> > for region
> >> >> >
> >> >>
> >>
> filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836,
> >> >> > row
> >> >> >
> >> >>
> >>
> 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f',
> >> >> > but failed after 10 attempts
> >> >> >
> >> >> > So while the above is the error that
> brought
> >> the
> >> >> offline
> >> >> > table to my attention - it may just
> be a
> >> separate bug?
> >> >>
> >> >> >
> >> >> > Not sure what causes it, but since
> it
> >> happens
> >> >> consistently
> >> >> > in a program being run with one set
> of
> >> arguments, but
> >> >> not
> >> >> > another, I'm thinking it's an error
> on my
> >> part.
> >> >> >
> >> >> > Any ideas on what could cause the
> table to
> >> go
> >> >> offline?
> >> >> > Any common mistakes that lead to
> >> RetriesExhausted
> >> >> errors?
> >> >> >
> >> >> > The Retry errors occurred in a
> shared method
> >> that
> >> >> uploads a
> >> >> > file to the filestore, so I'm not
> sure what
> >> causes it
> >> >> to
> >> >> > fail in one case, but not another.
> Maybe just
> >> the size
> >> >> of
> >> >> > the file? (@300K).
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > Take care,
> >> >> >   -stu
> >> >> >
> >> >> >
> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> >
>

Re: Table goes offline - temporary outage + Retries Exhausted (related?)

Reply via email to