Sorry J-D, I tried one more thing. I greped for "compaction" on the .META RS log. I saw a lot of them, each seem to be taking quite long too (one took 23 secs). Would this be something worth looking into?
11/03/01 16:44:30 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 9sec 11/03/01 16:46:18 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:46:18 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982189854 11/03/01 16:46:27 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5138927692276532859; store size is 11.1m 11/03/01 16:46:27 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 9sec 11/03/01 16:47:08 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:08 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982189946 11/03/01 16:47:18 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/5873146955360312580; store size is 11.1m 11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 10sec 11/03/01 16:47:18 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:18 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 0sec 11/03/01 16:47:19 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:20 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982190038 11/03/01 16:47:43 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2025458053952320907; store size is 11.2m 11/03/01 16:47:43 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 23sec 11/03/01 16:47:43 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:43 INFO regionserver.Store: Started compaction of 4 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982190222 11/03/01 16:47:46 INFO regionserver.Store: Completed compaction of 4 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/7645102490458560555; store size is 11.2m 11/03/01 16:47:46 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 3sec 11/03/01 16:47:47 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:47 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982190267 11/03/01 16:47:57 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2555741109593674333; store size is 11.2m 11/03/01 16:47:58 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 10sec 11/03/01 16:47:58 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:47:58 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982190359 11/03/01 16:48:08 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/2032875459950620676; store size is 11.2m 11/03/01 16:48:09 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 10sec 11/03/01 16:48:09 INFO regionserver.HRegion: Starting compaction on region .META.,,1 11/03/01 16:48:09 INFO regionserver.Store: Started compaction of 3 file(s) in info of .META.,,1 into /hbase/.META./compaction.dir/1028785192, seqid=1298982190451 11/03/01 16:48:22 INFO regionserver.Store: Completed compaction of 3 file(s) in info of .META.,,1; new storefile is hdfs://sa-hbase-server-9002.sea19.amazon.com:50001/hbase/.META./1028785192/info/1870868686430006280; store size is 11.3m 11/03/01 16:48:22 INFO regionserver.HRegion: compaction completed on region .META.,,1 in 13sec 11/03/01 16:48:22 INFO regionserver.HRegion: Starting compaction on region .META.,,1 On Tue, Mar 1, 2011 at 11:01 AM, Jean-Daniel Cryans <[email protected]> wrote: > That is correct. Also, disabling a table requires scanning and > modifying the .META. region so if it happens to be slow then that > operation will be slow too. > > J-D > > On Tue, Mar 1, 2011 at 10:55 AM, Nanheng Wu <[email protected]> wrote: >> I see, so the Get only gets affected on regions not in client's cache. >> I think that explains why that error seems hard to reproduce. This is >> how I understood a Get works: if the client doesn't have the region in >> its cache it will read .META. table; and if it does it makes a RPC to >> the RS directly; also each RS has some data about the regions it >> actually hosts, so if the region is not there it means the client's >> cache is stale and the client has to read the .META. again. Is this >> correct? >> >> On Tue, Mar 1, 2011 at 10:48 AM, Jean-Daniel Cryans <[email protected]> >> wrote: >>> Depends if you have to query .META. first to get the location of the >>> region that hosts the row. >>> >>> J-D >>> >>> On Tue, Mar 1, 2011 at 10:45 AM, Nanheng Wu <[email protected]> wrote: >>>> Man I appreciate so much all the help you provided so far. I guess >>>> I'll keep digging. Would this meta scan cause Get or Scan on user >>>> tables to be very very slow too? That was actually was the problem I >>>> first noticed: from time to time a client Get or Scan would be >>>> extremely slow on my tables. >>>> >>>> On Tue, Mar 1, 2011 at 10:38 AM, Jean-Daniel Cryans <[email protected]> >>>> wrote: >>>>> The store file looks fairly new, maybe it was just major compacted? >>>>> You could easily figure that out from the region server's log. >>>>> >>>>> Your problem is definitely a .META. region slowness. If scanning 8k >>>>> rows takes 30 minutes, scanning and updating 260 rows at the same time >>>>> probably isn't much faster. >>>>> >>>>> J-D >>>>> >>>>> On Tue, Mar 1, 2011 at 10:34 AM, Nanheng Wu <[email protected]> wrote: >>>>>> J-D, >>>>>> >>>>>> It looks like I don't have that many files: >>>>>> >>>>>> drwxr-xr-x - root supergroup 0 2011-02-28 23:21 >>>>>> /hbase/.META./1028785192 >>>>>> -rw-r--r-- 3 root supergroup 932 2011-02-03 19:23 >>>>>> /hbase/.META./1028785192/.regioninfo >>>>>> drwxr-xr-x - root supergroup 0 2011-03-01 17:13 >>>>>> /hbase/.META./1028785192/info >>>>>> -rw-r--r-- 3 root supergroup 11831330 2011-03-01 17:13 >>>>>> /hbase/.META./1028785192/info/6687606720393313750 >>>>>> drwxr-xr-x - root supergroup 0 2011-03-01 17:13 >>>>>> /hbase/.META./compaction.dir >>>>>> >>>>>> So you think compaction on .META. table on the RS could be the >>>>>> problem? while the log metaScanner is running I saw virtually no logs >>>>>> on that server though, I did see compactions once the meta scan ended >>>>>> and region close started to happen. I wonder what the cluster is doing >>>>>> while the meta scan is going. >>>>>> >>>>>> I would definitely upgrade in the near future, it's just that I am >>>>>> afraid upgrading won't guarantee a fix if I don't understand the >>>>>> nature of the problem more clearly. We also have another test cluster, >>>>>> running the same versions of everything on cheaper hardware and it >>>>>> doesn't have this problem. >>>>>> >>>>>> On Tue, Mar 1, 2011 at 9:52 AM, Jean-Daniel Cryans <[email protected]> >>>>>> wrote: >>>>>>> That long .META. scan might be the culprit, good catch! >>>>>>> >>>>>>> If you run this from the hadoop folder: >>>>>>> >>>>>>> bin/hadoop dfs -lsr /hbase/.META. >>>>>>> >>>>>>> Do you see a ton of files? Let's define "ton" as more than 16 files. >>>>>>> If so, and I see you have a very high number of regions, then you >>>>>>> could be hitting this problem: >>>>>>> https://issues.apache.org/jira/browse/HBASE-3499 >>>>>>> >>>>>>> In short, every few .META. change will result in a flush and it could >>>>>>> be that the region server hosting it isn't able to keep up with the >>>>>>> compactions. It's done so that since append isn't supported in your >>>>>>> hadoop that you won't lose too much data if the node dies, which is >>>>>>> really bad when it happens to .META. >>>>>>> >>>>>>> In your case it might not be so bad to set your .META.'s MEMSTORE_SIZE >>>>>>> back to the default value if you're not writing hence not splitting. >>>>>>> Running the script attached in that jira would do it (don't forget to >>>>>>> restart HBase after running it). >>>>>>> >>>>>>> Lastly, upgrading to HBase 0.90.1 and a hadoop that supports append >>>>>>> should be a priority. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Tue, Mar 1, 2011 at 9:30 AM, Nanheng Wu <[email protected]> wrote: >>>>>>>> Hi J-D: >>>>>>>> >>>>>>>> I did the scan like you suggested but no splits came up. This kind >>>>>>>> of makes sense to me, since we write Hfiles using a MR job and then >>>>>>>> pretty much use HBase for read-only so maybe that's why the regions >>>>>>>> never got split? I think my slow disable problem is related to this >>>>>>>> RegionManager.metaScanner. I greped for that in the Masters log and I >>>>>>>> saw: >>>>>>>> 11/03/01 15:30:37 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname: >>>>>>>> .META.,,1, startKey: <>} >>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020, >>>>>>>> regionname: .META.,,1, startKey: <>} complete >>>>>>>> 11/03/01 15:55:40 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname: >>>>>>>> .META.,,1, startKey: <>} >>>>>>>> Thread 27 (RegionManager.metaScanner): >>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scan of 8296 row(s) of meta region {server: 10.146.14.38:60020, >>>>>>>> regionname: .META.,,1, startKey: <>} complete >>>>>>>> 11/03/01 16:21:04 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scanning meta region {server: 10.146.14.38:60020, regionname: >>>>>>>> .META.,,1, startKey: <>} >>>>>>>> Thread 27 (RegionManager.metaScanner): >>>>>>>> 11/03/01 16:46:07 INFO master.BaseScanner: RegionManager.metaScanner >>>>>>>> scan of 8298 row(s) of meta region {server: 10.146.14.38:60020, >>>>>>>> regionname: .META.,,1, startKey: <>} complete >>>>>>>> >>>>>>>> It looks like meta scanner is always running, each time taking 20-30 >>>>>>>> minutes. Here's what I did: >>>>>>>> >>>>>>>> 1) disable table -> at 16:28:31, according the log above there's a >>>>>>>> meta scan in progress. >>>>>>>> >>>>>>>> 2 ) the command hangs, no output in .META RS's log >>>>>>>> at 11/03/01 16:46:57 the client got 11/03/01 16:46:58 DEBUG >>>>>>>> zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got >>>>>>>> 10.146.14.38:60020 >>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Found >>>>>>>> ROOT at 10.146.14.38:60020 >>>>>>>> 11/03/01 16:46:58 DEBUG client.HConnectionManager$TableServers: Cached >>>>>>>> location for .META.,,1 is 10.146.14.38:60020 >>>>>>>> This corresponds exact to when the meta scan completed. .META RS >>>>>>>> started to spew some logs about closing regions. >>>>>>>> >>>>>>>> 3) Another meta scan started and the client hangs again until it >>>>>>>> finishes, and disable command returns.The whole thing took 2688.5930 >>>>>>>> seconds >>>>>>>> >>>>>>>> What kind of problem does it look like to you J-D? What does this >>>>>>>> background task metaScanner do? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 28, 2011 at 5:17 PM, Jean-Daniel Cryans >>>>>>>> <[email protected]> wrote: >>>>>>>>> I didn't get most of the information I was looking for... but I do >>>>>>>>> have something to work with. So I asked whether the metaScanner was >>>>>>>>> making progress or not, and from that log I think I see it does: >>>>>>>>> >>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers: >>>>>>>>>> Rowscanned=261, rowsOffline=3 >>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all >>>>>>>>>> regions to be disabled from myTable >>>>>>>>> >>>>>>>>> Only 3 rows are offline, whereas we expect all the region to be >>>>>>>>> offline. I expect you see those lines a lot? If so, one thing I'd like >>>>>>>>> to see is the result of this command: >>>>>>>>> >>>>>>>>> scan '.META.', {STARTROW => "myTable,,", LIMIT => 261} >>>>>>>>> >>>>>>>>> It's going to be big. Then grep in the result for the string SPLIT, >>>>>>>>> and please post back here the lines that match. >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Mon, Feb 28, 2011 at 5:04 PM, Nanheng Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> I issued two more disable commands and looked at the .META RS. The >>>>>>>>>> last log entry looks like: >>>>>>>>>> >>>>>>>>>> 11/03/01 00:29:14 INFO regionserver.HLog: Roll >>>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog >>>>>>>>>> .dat.1298935752660, entries=13527, calcsize=3839046, >>>>>>>>>> filesize=2560143. >>>>>>>>>> New hlog >>>>>>>>>> /hbase/.logs/<RS_HOSTNAME>,60020,1298935751670/hlog.dat.1298939354072 >>>>>>>>>> >>>>>>>>>> I issued the command at: 00:34:59, the client debug message has >>>>>>>>>> nothing but "11/03/01 00:34:59 DEBUG zookeeper.ClientCnxn: Got ping >>>>>>>>>> response for sessionid:0x12e6e938e9c000b after 90ms" until 00:47:19. >>>>>>>>>> Then all of a sudden it outputs the following: >>>>>>>>>> >>>>>>>>>> 0:47:28 DEBUG client.HConnectionManager$TableServers: Cache hit for >>>>>>>>>> row <> in tableName .META.: location server 10.146.14.38:60020, >>>>>>>>>> location region name .META.,,1 >>>>>>>>>> 11/03/01 00:47:28 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode >>>>>>>>>> /hbase/root-region-server got 10.146.14.38:60020 >>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: Found >>>>>>>>>> ROOT at 10.146.14.38:60020 >>>>>>>>>> 11/03/01 00:47:28 DEBUG client.HConnectionManager$TableServers: >>>>>>>>>> Cached >>>>>>>>>> location for .META.,,1 is 10.146.14.38:60020 >>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HConnectionManager$TableServers: >>>>>>>>>> Rowscanned=261, rowsOffline=3 >>>>>>>>>> 11/03/01 00:47:30 DEBUG client.HBaseAdmin: Sleep. Waiting for all >>>>>>>>>> regions to be disabled from myTable >>>>>>>>>> 11/03/01 00:47:31 DEBUG client.HBaseAdmin: Wake. Waiting for all >>>>>>>>>> regions to be disabled from myTable >>>>>>>>>> >>>>>>>>>> Here's the full thread dump of the .META RS from 00:34:59 to >>>>>>>>>> 00:47:31 >>>>>>>>>> http://pastebin.com/Ywwx8afQ >>>>>>>>>> >>>>>>>>>> I hope I compiled some useful information here. I am pretty lost at >>>>>>>>>> this point, I would really appreciate your help! >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Feb 28, 2011 at 2:58 PM, Jean-Daniel Cryans >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> Ok so no debug from the disable command? BTW you should output your >>>>>>>>>>> GC >>>>>>>>>>> somewhere else like the .out file or a completely separate file. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Mon, Feb 28, 2011 at 2:54 PM, Nanheng Wu <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> I see, so I should jstack the .META region. I'll do that. >>>>>>>>>>>> >>>>>>>>>>>> The master log pretty much looks like this: should I grep for >>>>>>>>>>>> something specific? >>>>>>>>>>>> >>>>>>>>>>>> 11/02/28 22:52:56 INFO master.BaseScanner: >>>>>>>>>>>> RegionManager.rootScanner >>>>>>>>>>>> scan of 1 row(s) of meta region complete >>>>>>>>>>>> 2011-02-28T22:52:57.138+0000: 263241.014: [GC 263241.014: [ParNew: >>>>>>>>>>>> 19136K->2112K(19136K), 0.0033350 secs] 378606K->36159 >>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0034330 secs] [Times: user=0.02 sys=0.00, >>>>>>>>>>>> real=0.00 secs] >>>>>>>>>>>> 2011-02-28T22:53:04.956+0000: 263248.832: [GC 263248.832: [ParNew: >>>>>>>>>>>> 19068K->2112K(19136K), 0.0029690 secs] 378552K->36163 >>>>>>>>>>>> 0K(641944K) icms_dc=0 , 0.0030460 secs] [Times: user=0.02 sys=0.00, >>>>>>>>>>>> real=0.00 secs] >>>>>>>>>>>> 2011-02-28T22:53:12.664+0000: 263256.540: [GC 263256.541: [ParNew: >>>>>>>>>>>> 19136K->2112K(19136K), 0.0037690 secs] 378654K->36164 >>>>>>>>>>>> 7K(641944K) icms_dc=0 , 0.0038660 secs] [Times: user=0.01 sys=0.00, >>>>>>>>>>>> real=0.01 secs] >>>>>>>>>>>> 2011-02-28T22:53:20.608+0000: 263264.484: [GC 263264.484: [ParNew: >>>>>>>>>>>> 19136K->1810K(19136K), 0.0032600 secs] 378671K->36134 >>>>>>>>>>>> 6K(641944K) icms_dc=0 , 0.0033470 secs] [Times: user=0.02 sys=0.00, >>>>>>>>>>>> real=0.01 secs] >>>>>>>>>>>> 2011-02-28T22:53:28.278+0000: 263272.154: [GC 263272.154: [ParNew: >>>>>>>>>>>> 18834K->2112K(19136K), 0.0029870 secs] 378370K->36166 >>>>>>>>>>>> 1K(641944K) icms_dc=0 , 0.0030670 secs] [Times: user=0.01 sys=0.00, >>>>>>>>>>>> real=0.01 secs] >>>>>>>>>>>> 2011-02-28T22:53:35.919+0000: 263279.795: [GC 263279.795: [ParNew: >>>>>>>>>>>> 19136K->1747K(19136K), 0.0037090 secs] 378685K->36129 >>>>>>>>>>>> 8K(641944K) icms_dc=0 , 0.0037920 secs] [Times: user=0.02 sys=0.00, >>>>>>>>>>>> real=0.01 secs] >>>>>>>>>>>> 11/02/28 22:53:36 INFO master.ServerManager: 9 region servers, 0 >>>>>>>>>>>> dead, >>>>>>>>>>>> average load 760.2222222222222 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:49 PM, Jean-Daniel Cryans >>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> Are you able to tell if that call in metaScanner is hanging or >>>>>>>>>>>>> it's >>>>>>>>>>>>> making multiple calls to the .META. region? >>>>>>>>>>>>> >>>>>>>>>>>>> If former, then jstack the region server that hosts .META. and see >>>>>>>>>>>>> where it's blocked. >>>>>>>>>>>>> >>>>>>>>>>>>> if latter, then it means your .META. region is slow? Again, what's >>>>>>>>>>>>> going on on the RS that hosts .META.? >>>>>>>>>>>>> >>>>>>>>>>>>> Finally, what's the master's log like during that time? >>>>>>>>>>>>> >>>>>>>>>>>>> J-D >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Feb 28, 2011 at 2:41 PM, Nanheng Wu <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks so much for your help so far! I sent disable commands on >>>>>>>>>>>>>> 4 >>>>>>>>>>>>>> rather small tables and they got stuck for a long time again, so >>>>>>>>>>>>>> I >>>>>>>>>>>>>> took jstack of the master. From what I can tell, all disableTable >>>>>>>>>>>>>> calls are blocked by a meta scanner thread (sample log below). >>>>>>>>>>>>>> At the >>>>>>>>>>>>>> moment there were no other requests to the server at all. How >>>>>>>>>>>>>> should I >>>>>>>>>>>>>> investigate this further? If it helps, here's some background: I >>>>>>>>>>>>>> have >>>>>>>>>>>>>> several datasets, each of them is in a separate table; Our data >>>>>>>>>>>>>> pipeline produces a new version of each dataset everyday and >>>>>>>>>>>>>> only the >>>>>>>>>>>>>> lastest version should be used. This is how the data is loaded: >>>>>>>>>>>>>> for >>>>>>>>>>>>>> each dataset 1) run a MR job and outputs HFiles 2) call >>>>>>>>>>>>>> loadTable.rb >>>>>>>>>>>>>> to create a new table 3) disable and drop the previous version. >>>>>>>>>>>>>> As a >>>>>>>>>>>>>> result some calls to load table and drop table would overlap. >>>>>>>>>>>>>> Please >>>>>>>>>>>>>> let me know if something stands out to you as a potential >>>>>>>>>>>>>> culprit. >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>> >>>>>>>>>>>>>> BTW, I am running Hadoop 0.20.2 with HBase 0.20.6 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thread 47 (IPC Server handler 13 on 60000): >>>>>>>>>>>>>> State: BLOCKED >>>>>>>>>>>>>> Blocked count: 3801 >>>>>>>>>>>>>> Waited count: 72719 >>>>>>>>>>>>>> Blocked on java.lang.Object@75ac522c >>>>>>>>>>>>>> Blocked by 27 (RegionManager.metaScanner) >>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.TableOperation.process(TableOperation.java:154) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.HMaster.disableTable(HMaster.java:842) >>>>>>>>>>>>>> sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) >>>>>>>>>>>>>> >>>>>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>>>>>>>>> java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thread 27 (RegionManager.metaScanner): >>>>>>>>>>>>>> State: WAITING >>>>>>>>>>>>>> Blocked count: 1526058 >>>>>>>>>>>>>> Waited count: 1488998 >>>>>>>>>>>>>> Waiting on org.apache.hadoop.hbase.ipc.HBaseClient$Call@4dd44ab0 >>>>>>>>>>>>>> Stack: >>>>>>>>>>>>>> java.lang.Object.wait(Native Method) >>>>>>>>>>>>>> java.lang.Object.wait(Object.java:485) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333) >>>>>>>>>>>>>> $Proxy1.get(Unknown Source) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.BaseScanner.checkAssigned(BaseScanner.java:543) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:192) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) >>>>>>>>>>>>>> >>>>>>>>>>>>>> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153) >>>>>>>>>>>>>> org.apache.hadoop.hbase.Chore.run(Chore.java:68) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Feb 25, 2011 at 10:23 AM, Jean-Daniel Cryans >>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>> An hour to disable? That doesn't sound right at all :) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would approach this problem like I generally do with HBase >>>>>>>>>>>>>>> issue, >>>>>>>>>>>>>>> first check the master log for any weirdness regarding my >>>>>>>>>>>>>>> problem (in >>>>>>>>>>>>>>> this case, grep for the table name). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then I would look the region server log(s) of the nodes that >>>>>>>>>>>>>>> were >>>>>>>>>>>>>>> hosting regions from that table. You should see the steps taken >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> disable the regions (starting to close, flush, region completely >>>>>>>>>>>>>>> closed). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If you are able to do it while it's taking a very long time to >>>>>>>>>>>>>>> disable, try to jstack the process the seems to be hanging. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Finally, like I said in another thread, there's a bug in 0.20.6 >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> almost prevent disabling a table (or re-enabling) if any region >>>>>>>>>>>>>>> recently split and the parent wasn't cleaned yet from .META. >>>>>>>>>>>>>>> and that >>>>>>>>>>>>>>> is fixed in 0.90.1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 11:37 PM, Nanheng Wu >>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> I think you are right, maybe in the long run I need to >>>>>>>>>>>>>>>> re-architect my >>>>>>>>>>>>>>>> system so that it doesn't need to create new and delete old >>>>>>>>>>>>>>>> tables all >>>>>>>>>>>>>>>> the time. In the short term I am having a really hard time >>>>>>>>>>>>>>>> with the >>>>>>>>>>>>>>>> disabling function, I ran a disable command on a very small >>>>>>>>>>>>>>>> table >>>>>>>>>>>>>>>> (probably dozen of MBs in size) and are no client using the >>>>>>>>>>>>>>>> cluster at >>>>>>>>>>>>>>>> all, and that took about 1 hour to complete! The weird thing >>>>>>>>>>>>>>>> is on the >>>>>>>>>>>>>>>> web UI only the region server carrying the META table has >>>>>>>>>>>>>>>> non-zero >>>>>>>>>>>>>>>> requests, all other RS have 0 requests the entire time. I >>>>>>>>>>>>>>>> would think >>>>>>>>>>>>>>>> they should get some request to flush the memstore at least. I >>>>>>>>>>>>>>>> *am* >>>>>>>>>>>>>>>> using the same RS nodes for some map reduce job at the time >>>>>>>>>>>>>>>> and top >>>>>>>>>>>>>>>> shows the memory usage is almost full on the META region. >>>>>>>>>>>>>>>> Would you >>>>>>>>>>>>>>>> have some idea of what I should investigate? >>>>>>>>>>>>>>>> Thanks so much. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
