To follow up on the retry error (still have no idea about the table going offline):
It was coding error, sorta kinda. I was doing large batches with AutoFlush disabled, and flushing at the end, figuring I could gain performance, and just reprocess bad batches. Bad call. It appears I was consistently getting errors on flush, so the batch just kept failing. Now I flush after every successful file upload, and only one or two out of a couple thousand fail, and not consistently on one file, so retries are possible. I also added a 3 second sleep when I get some kind of IOException executing a a PUT on this particular table. To prevent some sort of cascade effect. That part is going pretty smooth now. Still don't know about the offline table thing - crossing my fingers and watching closely for now (and adding nodes). I guess the moral of the first lesson is to really treat the Puts() as somewhat unreliable? Take care, -stu --- On Thu, 7/29/10, Stuart Smith <[email protected]> wrote: > From: Stuart Smith <[email protected]> > Subject: Table goes offline - temporary outage + Retries Exhausted (related?) > To: [email protected] > Date: Thursday, July 29, 2010, 2:09 PM > Hello, > I have two problems that may or may not > be related. > > One is trying to figure out a self-correcting outage I had > last evening. > > I noticed issues starting with clients reporting: > > RetriesExhaustedException: Trying to contact region server > Some server... > > I didn't see much going on in the regionserver logs, except > for some major compactions. Eventually I decided to check > the status of the table being written to, and it was > disabled - and not by me (AFAIK). > > I tried enabling the table via the hbase shell.. and it was > taking a long time, so I left for the evening. I came > back this morning, and the shell had reported: > > hbase(main):002:0> enable 'filestore' > NativeException: java.io.IOException: Unable to enable > table filestore > > Except by now, the table was back up! > > After going through the logs a little more closely, the > only thing I can find that seems correlated (at least by the > timing): > > (in the namenode logs) > > 2010-07-28 18:39:17,213 INFO > org.apache.hadoop.hbase.master.ServerManager: Processing > MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: > filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1279721711873: > Daughters; > filestore,40d0be6fb72999fc5a69a9726544b004498127a788d63a69ba83eb2552a9d5ec,1280367555232, > filestore,40d8647ad2222e18901071d36124fa3f310970776028ecec7a94d57df10dba86,1280367555232 > from ubuntu-hadoop-3,60020,1280263369525; 1 of 1 > > ... > > 010-07-28 18:42:45,835 DEBUG > org.apache.hadoop.hbase.master.BaseScanner: > filestore,fa8a881cb23eb9e41b197d305275440453e9967e4e6cf53024d478ab984f3392,1280347781550/1176636191 > no longer has references to > filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171 > 2010-07-28 18:42:45,842 INFO > org.apache.hadoop.hbase.master.BaseScanner: Deleting region > filestore,fa7bf9992b94e60cb9d44437bd96d749b6e603285d92608ee0e9d5dedc858296,1279592800171 > (encoded=1245524105) because daughter splits no longer hold > references > ... > 2010-07-28 18:59:39,000 DEBUG > org.apache.hadoop.hbase.master.ChangeTableState: Processing > unserved regions > 2010-07-28 18:59:39,001 DEBUG > org.apache.hadoop.hbase.master.ChangeTableState: Skipping > region REGION => {NAME => > 'filestore,201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e,1279613059169', > STARTKEY => > '201b5a6ff4aac0b345b6f9cc66998d32f4fc06d28156ace352f5effca1996e7e', > ENDKEY => > '202ad98d24575f6782c9e9836834b77e0f5ddac0b1efa3cd21ac590482edf3e1', > ENCODED => 1808201339, OFFLINE => true, SPLIT => > true, TABLE => {{NAME => 'filestore', FAMILIES => > [{NAME => 'content', COMPRESSION => 'LZO', VERSIONS > => '3', TTL => '2147483647', BLOCKSIZE => '65536', > IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} because > it is offline and split > ... > 010-07-28 18:59:39,001 DEBUG > org.apache.hadoop.hbase.master.ChangeTableState: Processing > regions currently being served > 2010-07-28 18:59:39,002 DEBUG > org.apache.hadoop.hbase.master.ChangeTableState: Already > online > > ... > 010-07-28 19:00:34,485 INFO > org.apache.hadoop.hbase.master.ServerManager: 4 region > servers, 0 dead, average load 1060.0 > 2010-07-28 19:00:49,850 INFO > org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: > 192.168.193.67:60020, regionname: -ROOT-,,0, startKey: > <>} > 2010-07-28 19:00:49,858 INFO > org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region > {server: 192.168.193.67:60020, regionname: -ROOT-,,0, > startKey: <>} complete > 2010-07-28 19:01:06,981 DEBUG > org.apache.hadoop.hbase.master.BaseScanner: > filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1280348069422/1455931173 > no longer has references to > filestore,7498e0b3948939c37f9b75ceb5f5f2bec8ad3a41941032439741453d639e7752,1279713176956 > ... > I'm not really sure, but I saw these messages toward the > end: > ... > 2010-07-28 19:18:31,029 DEBUG > org.apache.hadoop.hbase.master.BaseScanner: > filestore,6541cf3f415214d56b0b385d11516b97e18fa2b25da141b770a8a9e0bfe60b52,1280359412067/1522934061 > no longer has references to > filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326 > 2010-07-28 19:18:31,061 INFO > org.apache.hadoop.hbase.master.BaseScanner: Deleting region > filestore,6531dadde150a8fb89907296753bdfaabf38238b9064118ebf0aa50a4917f8ba,1279700538326 > (encoded=597566178) because daughter splits no longer hold > references > 2010-07-28 19:18:31,061 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: DELETING > region > hdfs://ubuntu-namenode:54310/hbase/filestore/597566178 > ... > Which may correspond to the time when it was recovering (if > so, I just missed it coming back online). > ... > > As a final note, I re-ran some of the clients today, and it > appears some are OK, and some consistently give: > > Error: io exception when loading file: > /tmp/archive_transfer/AVT100727-0/803de9924dc8f2d6.bi > org.apache.hadoop.hbase.client.RetriesExhaustedException: > Trying to contact region server Some server, > retryOnlyOne=true, index=0, islastrow=false, tries=9, > numtries=10, i=3, listsize=7, > region=filestore,d00ca2d087bdeeb4ee57225a41e19de9dd07e4d9b03be99298046644f9c9e354,1279599904220 > for region > filestore,bdfa9f2173033330cfae81ece08f75f0002bf3f3a54cde6bbf9192f0187e275b,1279604506836, > row > 'be29a0028bab2149a6c4f990e99c4e7c1c5be0656594738bbe87e7bf0fcda57f', > but failed after 10 attempts > > So while the above is the error that brought the offline > table to my attention - it may just be a separate bug? > > Not sure what causes it, but since it happens consistently > in a program being run with one set of arguments, but not > another, I'm thinking it's an error on my part. > > Any ideas on what could cause the table to go offline? > Any common mistakes that lead to RetriesExhausted errors? > > The Retry errors occurred in a shared method that uploads a > file to the filestore, so I'm not sure what causes it to > fail in one case, but not another. Maybe just the size of > the file? (@300K). > > Thanks! > > Take care, > -stu > > > > > >
