Thanks Ted. St.Ack
On Fri, Jan 21, 2011 at 11:05 AM, Ted Yu <[email protected]> wrote: > It seems there is a typo: > "we'll no interrupt the running compaction" should be "we'll now interrupt > the running compaction" > > On Fri, Jan 21, 2011 at 10:47 AM, Stack <[email protected]> wrote: > >> On Fri, Jan 21, 2011 at 4:51 AM, Wayne <[email protected]> wrote: >> > After several hours I have figured out how to get the Disable command to >> > work and how to delete manually, but in the process there are 4 problems >> I >> > encountered that I think are areas that could be improved (or my >> > understanding improved). >> > >> > 1) The client timeout is used for the disable command which was my >> problem. >> > Does this totally make sense? Should a DML minded timeout be used for DDL >> > statements that we know can take a very long time normally with a large >> > cluster? >> > >> >> Sorry Wayne. I meant to respond yesterday to your original query. >> >> Enable/Disable has been redone in 0.90. Now there are added >> enabling/disabling states that are maintained up in zk and in shell >> there are commands is_enabled and is_disabled. We still have the same >> (DML) timeout (sortof -- see below for more) but at least now if it >> times out, you are not hosed. The disable or enable process is still >> running and you can query its state. There is also notion of async >> enable/disable though this latter facility is not exposed in shell, >> only in the HBaseAdmin API. >> >> >> > 2) If the disable command fails the first time it does not "roll back". >> The >> > ONLY way to proceed is to enable and then try to disable again. The first >> > disable attempt is all that seems to work. Subsequent disable statements >> > usually work without errors but never seem to "work". The entire table >> > should be disabled after issuing this command or the entire table should >> > still be enabled. I was caught in this half disabled or mostly disabled >> > which was very frustating. >> > >> >> Sorry about that. Should be better in 0.90.0. >> >> Things should run a bit faster in 0.90.0 too because disable used to >> include an update of .META. per region plus a close of all regions >> that make up the table. In 0.90.0 there is no longer the .META. >> update and close is more prompt now; in the past close would wait on >> any running compactions to complete before proceeding. In 0.90.0 >> we'll no interrupt the running compaction so close happens the sooner. >> >> There is room for a bunch more improvement. For example, deleting a >> table, there should be short-circuit that punts on flush of in-memory >> state and clean-close of open regions. >> >> > 3) The biggest issue of all is why certain regions do not report back to >> the >> > disable command. What are the various states of a region that could cause >> > this? Compaction I know is one, what else could cause the disable command >> to >> > take too long? Shouldn't a disable force itself through and wait long >> enough >> > to be able to disable every region? Again a long wait time or a more >> > forceful operation would help. >> > >> >> It wasn't that smart in 0.20/0.89. Its still pretty dumb but better in >> 0.90.0. >> >> Master process runs the enable/disable process in both old and new >> HBase. In 0.20/0.89, it was a sync process w/ master waiting on >> regions to flip to 'offline' after successful close. The state of >> disabledness was when all regions in table had 'offline' state. Any >> hiccup, a problem closing or a failure to update .META. w/ offline per >> region would bork the disabling process. It was super fragile. We >> tried to talk it up as so. >> >> In 0.90, client queues in master an executor that flips table to >> disabling in zk and then in parallel sends out unassigns of all table >> regions. The executor then hangs around with a more DDL-like timeout >> of hbase.bulk.assignment.waiton.empty.rit (10minutes by default). >> Meantime clients can check state of the disable. After all unassigns >> complete, the table is flipped to disabled. >> >> >> > 4) Through all of the attempts to disable I saw regions coming and going >> and >> > nothing was consistent. The UI showed the table as disabled and listed 1 >> > region in the table (there were 1000s). The node view listed several >> other >> > regions but not the same one as the table view. It was a very strange >> > situation. The UI to browse the tables and regions is great but it would >> be >> > even better if it gave a 100% view of regions and their current states. A >> > summary view of region counts per table based on state or status would be >> > fantastic. >> >> Please file a JIRA. Sounds like good idea. We could hoist stuff up >> out of hbck tool up into UI. >> >> >> > There is a compaction count, but what about in split, read/rite >> > lock, disabled, etc. What is the precise list of regions states that >> could >> > occur and show a summary count per state as well as detailed state for >> each >> > specific region in the list. Fundamentally this is the health monitor of >> the >> > system and as a dba I really need to know the 100% count of regions and >> > where they are all at in terms of availability. Are they disabled, >> blocked >> > for writes, blocked for reads, in compaction, etc. etc. If there are >> various >> > states that cause disabling to be blocked it can be reported here so that >> I >> > at least know when a disable command can be executed successfully (and >> this >> > should be documented). >> > >> >> >> Please file a JIRA. This is great stuff. >> >> Sorry for pain caused messing w/ broke enable/disable. It should be >> better in 0.90 and easier to fix if bugs. >> >> St.Ack >> >> >> > Thanks >> > >> > On Thu, Jan 20, 2011 at 9:01 PM, Wayne <[email protected]> wrote: >> > >> >> I need to delete some tables and I am not sure the best way to do it. >> The >> >> shell does not work. The disable command says it runs ok but every time >> I >> >> run drop or truncate I get an exception that says the table is not >> >> disabled. The UI shows it as disabled but truncate/drop still do not >> work. >> >> I have even tried to restart the cluster as sometimes that makes the >> disable >> >> "stick". >> >> >> >> What is the best way to delete a table manually? My assumption is that >> with >> >> 10k regions in 3 tables that I need to delete that the shell is not >> going to >> >> work. How can I do this without a completely fresh install of >> everything? >> >> How can the data/tables be removed manually without too much pain? >> >> >> >> Thanks. >> >> >> > >> >
