Hello Ryan, I'll get the logs together tomorrow.
I had to spend the last couple hours getting a static ip block & bringing the cluster up & down. Thanks for the help & enthusiasm! Take care, -stu --- On Thu, 7/29/10, Ryan Rawson <[email protected]> wrote: > From: Ryan Rawson <[email protected]> > Subject: Re: Table goes offline - temporary outage + Retries Exhausted > (related?) > To: [email protected] > Date: Thursday, July 29, 2010, 7:36 PM > There is some root cause behind the > 'failed to flush' message... I'd > like to get to the root of that. Unfortunately it > means lots of log > groveling. If you want to post logs, try pastebin.com > instead of > trying to attach files. > > Dig some dirt up and lets check it out :-) > > -ryan > > On Thu, Jul 29, 2010 at 4:25 PM, Stuart Smith <[email protected]> > wrote: > > Hello Ryan, > > > > Thanks! > > > > Just to verify - my xceiver count is 4K, my ulimit > reports 64000, my datanode handler count is 15, my socket > write timeout is zero, my swappiness is 1 on datanodes and 0 > on the namenode, and my memory has been tweaked according to > the machines - hadoop and hbase both get 3GB on 8GB RAM > datanodes - leaving 2GB free. The namenode has 16GB and is > split 6GB/6GB. > > > > After my last round of issues I went through the faq > & a bunch of blogs - of which some were yours I think - > so thanks again :) > > > > I get > > > > Warning: failed to flush data to sample store: Trying > to contact region server Some server, retryOnlyOne=true, > index=0, islastrow=false, tries=9, numtries=10, i=9, > listsize=13, > region=filestore,be4c6d071635b80ac649b7900167f6ddd7cc2dca3578ce8bc24fca523930e81c,1279956247376 > for region > filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836, > row > 'be113824be800baddf62c27ac9cf12a57955a3582d7d8f53541017416cf18ed1', > but failed after 10 attempts. > > > > on about 20 files out of 6000 - so right now I > just redo the batch, and skip existing entries, which works > for now. > > > > What I think I need to do is come up with a nice set > of java snippets that illustrate my code, and re-post. But > that might not happen right away. > > > > My app is this multi-threaded thingy that has thread > pools with threads that have thread pools, ftps in archives, > extracts them, checks for dupes, other stuff, and uploads > files. Which is one reason I think it might be a client side > thing ~ but I did wrap my puts with synchronized( table ) {} > ;) > > > > And, yes, for all the tweaking I've had to do on Hbase > ~ it sure beats the time I needed to alter an innodb table > with about 800 million rows of blobs & stuff... took > about a week. > > > > Take care, > > -stu > > > > > > > > --- On Thu, 7/29/10, Ryan Rawson <[email protected]> > wrote: > > > >> From: Ryan Rawson <[email protected]> > >> Subject: Re: Table goes offline - temporary outage > + Retries Exhausted (related?) > >> To: [email protected] > >> Date: Thursday, July 29, 2010, 6:40 PM > >> Hi, > >> > >> There is a lot going on in this email, the logs > might look > >> promising > >> but they are standard split messages, not really > indicative > >> of > >> anything going wrong. > >> > >> It sounds like you might be coming across some of > the > >> standard foils > >> that are well documented in here: > >> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#overview_description > >> > >> Perhaps you could confirm you have things like > xceiver > >> count, and > >> ulimits set? I personally use this on all my > >> clusters, maybe you > >> can try it again: > >> <property> > >> > <name>dfs.datanode.socket.write.timeout</name> > >> <value>0</value> > >> </property> > >> > >> Lastly, I dont think that Put should be > unreliable, I have > >> reliably > >> imported 10s of billions of rows, so there is > something > >> else going on. > >> > >> -ryan > >> PS: mysql dbas spend tons of time setting up > ulimits and > >> other > >> esoteric kernel tuning parameters, our requirement > is > >> actually > >> surprisingly low in that regard. > >> > >> On Thu, Jul 29, 2010 at 3:02 PM, Stuart Smith > <[email protected]> > >> wrote: > >> > Hello all, > >> > > >> > It looks like I had an ensemble of unrelated > errors. > >> > > >> > To follow up with the table going offline > error: > >> > > >> > I noticed today the the gui will say: > "Enabled False", > >> and the shell will say: > >> > > >> > hbase(main):004:0> describe 'filestore' > >> > DESCRIPTION > >> > ENABLED > >> > {NAME => 'filestore', FAMILIES => > [{NAME => > >> 'content', COMPRESSION => false > >> > 'LZO', VERSIONS => '3', TTL => > '2147483647', > >> BLOCKSIZE => '65536', IN_ > >> > > >> > Soo... I'm not sure which is which - maybe it > was > >> never disabled, depending on whether the gui or > shell is > >> correct. It appears to be the shell, since I've > been > >> uploading more data, and it's going through fine > now. > >> > > >> > I'm guessing yesterday uploads were failing > due to the > >> batch issues, and the gui reported the table as > disabled, > >> and I connected the two issues incorrectly. > >> > > >> > Take care, > >> > -stu > >> > > >> > --- On Thu, 7/29/10, Stuart Smith <[email protected]> > >> wrote: > >> > > >> >> From: Stuart Smith <[email protected]> > >> >> Subject: Re: Table goes offline - > temporary outage > >> + Retries Exhausted (related?) > >> >> To: [email protected] > >> >> Date: Thursday, July 29, 2010, 3:19 PM > >> >> To follow up on the retry error > >> >> (still have no idea about the table > going > >> offline): > >> >> > >> >> It was coding error, sorta kinda. > >> >> > >> >> I was doing large batches with AutoFlush > disabled, > >> and > >> >> flushing at the end, figuring I could > gain > >> performance, and > >> >> just reprocess bad batches. > >> >> > >> >> Bad call. > >> >> > >> >> It appears I was consistently getting > errors on > >> flush, so > >> >> the batch just kept failing. Now I flush > after > >> every > >> >> successful file upload, and only one or > two out of > >> a couple > >> >> thousand fail, and not consistently on > one file, > >> so retries > >> >> are possible. > >> >> > >> >> I also added a 3 second sleep when I get > some kind > >> of > >> >> IOException executing a a PUT on this > particular > >> table. To > >> >> prevent some sort of cascade effect. > >> >> > >> >> That part is going pretty smooth now. > >> >> > >> >> Still don't know about the offline table > thing - > >> crossing > >> >> my fingers and watching closely for now > (and > >> adding nodes). > >> >> > >> >> I guess the moral of the first lesson is > to really > >> treat > >> >> the Puts() as somewhat unreliable? > >> >> > >> >> Take care, > >> >> -stu > >> >> > >> >> > >> >> > >> >> --- On Thu, 7/29/10, Stuart Smith <[email protected]> > >> >> wrote: > >> >> > >> >> > From: Stuart Smith <[email protected]> > >> >> > Subject: Table goes offline - > temporary > >> outage + > >> >> Retries Exhausted (related?) > >> >> > To: [email protected] > >> >> > Date: Thursday, July 29, 2010, 2:09 > PM > >> >> > Hello, > >> >> > I have two problems that may > or may > >> not > >> >> > be related. > >> >> > > >> >> > One is trying to figure out a > self-correcting > >> outage I > >> >> had > >> >> > last evening. > >> >> > > >> >> > I noticed issues starting with > clients > >> reporting: > >> >> > > >> >> > RetriesExhaustedException: Trying to > contact > >> region > >> >> server > >> >> > Some server... > >> >> > > >> >> > I didn't see much going on in the > >> regionserver logs, > >> >> except > >> >> > for some major compactions. > Eventually I > >> decided to > >> >> check > >> >> > the status of the table being > written to, and > >> it was > >> >> > disabled - and not by me (AFAIK). > >> >> > > >> >> > I tried enabling the table via the > hbase > >> shell.. and > >> >> it was > >> >> > taking a long time, so I left for > the > >> evening. I > >> >> came > >> >> > back this morning, and the shell > had > >> reported: > >> >> > > >> >> > hbase(main):002:0> enable > 'filestore' > >> >> > NativeException: > java.io.IOException: Unable > >> to > >> >> enable > >> >> > table filestore > >> >> > > >> >> > Except by now, the table was back > up! > >> >> > > >> >> > After going through the logs a > little more > >> closely, > >> >> the > >> >> > only thing I can find that seems > correlated > >> (at least > >> >> by the > >> >> > timing): > >> >> > > >> >> > (in the namenode logs) > >> >> > > >> >> > 2010-07-28 18:39:17,213 INFO > >> >> > > >> org.apache.hadoop.hbase.master.ServerManager: > >> >> Processing > >> >> > > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: > >> >> > > >> >> > >> > filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873: > >> >> > Daughters; > >> >> > > >> >> > >> > filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232, > >> >> > > >> >> > >> > filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232 > >> >> > from > ubuntu-hadoop-3,60020,1280263369525; 1 > >> of 1 > >> >> > > >> >> > ... > >> >> > > >> >> > 010-07-28 18:42:45,835 DEBUG > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> >> > > >> >> > >> > filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191 > >> >> > no longer has references to > >> >> > > >> >> > >> > filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171 > >> >> > 2010-07-28 18:42:45,842 INFO > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> Deleting > >> >> region > >> >> > > >> >> > >> > filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171 > >> >> > (encoded=1245524105) because > daughter splits > >> no longer > >> >> hold > >> >> > references > >> >> > ... > >> >> > 2010-07-28 18:59:39,000 DEBUG > >> >> > > >> org.apache.hadoop.hbase.master.ChangeTableState: > >> >> Processing > >> >> > unserved regions > >> >> > 2010-07-28 18:59:39,001 DEBUG > >> >> > > >> org.apache.hadoop.hbase.master.ChangeTableState: > >> >> Skipping > >> >> > region REGION => {NAME => > >> >> > > >> >> > >> > 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169', > >> >> > STARTKEY => > >> >> > > >> >> > >> > '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e', > >> >> > ENDKEY => > >> >> > > >> >> > >> > '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1', > >> >> > ENCODED => 1808201339, OFFLINE > => true, > >> SPLIT > >> >> => > >> >> > true, TABLE => {{NAME => > 'filestore', > >> FAMILIES > >> >> => > >> >> > [{NAME => 'content', COMPRESSION > => > >> 'LZO', > >> >> VERSIONS > >> >> > => '3', TTL => '2147483647', > BLOCKSIZE > >> => > >> >> '65536', > >> >> > IN_MEMORY => 'false', BLOCKCACHE > => > >> 'true'}]}} > >> >> because > >> >> > it is offline and split > >> >> > ... > >> >> > 010-07-28 18:59:39,001 DEBUG > >> >> > > >> org.apache.hadoop.hbase.master.ChangeTableState: > >> >> Processing > >> >> > regions currently being served > >> >> > 2010-07-28 18:59:39,002 DEBUG > >> >> > > >> org.apache.hadoop.hbase.master.ChangeTableState: > >> >> Already > >> >> > online > >> >> > > >> >> > ... > >> >> > 010-07-28 19:00:34,485 INFO > >> >> > > org.apache.hadoop.hbase.master.ServerManager: > >> 4 > >> >> region > >> >> > servers, 0 dead, average load > 1060.0 > >> >> > 2010-07-28 19:00:49,850 INFO > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> >> > RegionManager.rootScanner scanning > meta > >> region > >> >> {server: > >> >> > 192.168.193.67:60020, regionname: > -ROOT-,,0, > >> >> startKey: > >> >> > <>} > >> >> > 2010-07-28 19:00:49,858 INFO > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> >> > RegionManager.rootScanner scan of 1 > row(s) of > >> meta > >> >> region > >> >> > {server: 192.168.193.67:60020, > regionname: > >> -ROOT-,,0, > >> >> > startKey: <>} complete > >> >> > 2010-07-28 19:01:06,981 DEBUG > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> >> > > >> >> > >> > filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173 > >> >> > no longer has references to > >> >> > > >> >> > >> > filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956 > >> >> > ... > >> >> > I'm not really sure, but I saw these > messages > >> toward > >> >> the > >> >> > end: > >> >> > ... > >> >> > 2010-07-28 19:18:31,029 DEBUG > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> >> > > >> >> > >> > filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061 > >> >> > no longer has references to > >> >> > > >> >> > >> > filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326 > >> >> > 2010-07-28 19:18:31,061 INFO > >> >> > > org.apache.hadoop.hbase.master.BaseScanner: > >> Deleting > >> >> region > >> >> > > >> >> > >> > filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326 > >> >> > (encoded=597566178) because daughter > splits > >> no longer > >> >> hold > >> >> > references > >> >> > 2010-07-28 19:18:31,061 DEBUG > >> >> > > >> org.apache.hadoop.hbase.regionserver.HRegion: > >> >> DELETING > >> >> > region > >> >> > > >> >> > >> > hdfs://ubuntu-namenode:54310/hbase/filestore/597566178 > >> >> > ... > >> >> > Which may correspond to the time > when it was > >> >> recovering (if > >> >> > so, I just missed it coming back > online). > >> >> > ... > >> >> > > >> >> > As a final note, I re-ran some of > the clients > >> today, > >> >> and it > >> >> > appears some are OK, and some > consistently > >> give: > >> >> > > >> >> > Error: io exception when loading > file: > >> >> > > >> > /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi > >> >> > > >> >> > >> > org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> >> > Trying to contact region server Some > server, > >> >> > retryOnlyOne=true, index=0, > islastrow=false, > >> tries=9, > >> >> > numtries=10, i=3, listsize=7, > >> >> > > >> >> > >> > region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220 > >> >> > for region > >> >> > > >> >> > >> > filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836, > >> >> > row > >> >> > > >> >> > >> > 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f', > >> >> > but failed after 10 attempts > >> >> > > >> >> > So while the above is the error that > brought > >> the > >> >> offline > >> >> > table to my attention - it may just > be a > >> separate bug? > >> >> > >> >> > > >> >> > Not sure what causes it, but since > it > >> happens > >> >> consistently > >> >> > in a program being run with one set > of > >> arguments, but > >> >> not > >> >> > another, I'm thinking it's an error > on my > >> part. > >> >> > > >> >> > Any ideas on what could cause the > table to > >> go > >> >> offline? > >> >> > Any common mistakes that lead to > >> RetriesExhausted > >> >> errors? > >> >> > > >> >> > The Retry errors occurred in a > shared method > >> that > >> >> uploads a > >> >> > file to the filestore, so I'm not > sure what > >> causes it > >> >> to > >> >> > fail in one case, but not another. > Maybe just > >> the size > >> >> of > >> >> > the file? (@300K). > >> >> > > >> >> > Thanks! > >> >> > > >> >> > Take care, > >> >> > -stu > >> >> > > >> >> > > >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > >> >> > >> >> > >> >> > >> > > >> > > >> > > >> > > >> > > > > > > > > >
