On Fri, Jan 21, 2011 at 4:51 AM, Wayne <[email protected]> wrote:
> After several hours I have figured out how to get the Disable command to
> work and how to delete manually, but in the process there are 4 problems I
> encountered that I think are areas that could be improved (or my
> understanding improved).
>
> 1) The client timeout is used for the disable command which was my problem.
> Does this totally make sense? Should a DML minded timeout be used for DDL
> statements that we know can take a very long time normally with a large
> cluster?
>

Sorry Wayne.  I meant to respond yesterday to your original query.

Enable/Disable has been redone in 0.90.  Now there are added
enabling/disabling states that are maintained up in zk and in shell
there are commands is_enabled and is_disabled.  We still have the same
(DML) timeout (sortof -- see below for more) but at least now if it
times out, you are not hosed.  The disable or enable process is still
running and you can query its state.  There is also notion of async
enable/disable though this latter facility is not exposed in shell,
only in the HBaseAdmin API.


> 2) If the disable command fails the first time it does not "roll back". The
> ONLY way to proceed is to enable and then try to disable again. The first
> disable attempt is all that seems to work. Subsequent disable statements
> usually work without errors but never seem to "work". The entire table
> should be disabled after issuing this command or the entire table should
> still be enabled. I was caught in this half disabled or mostly disabled
> which was very frustating.
>

Sorry about that.   Should be better in 0.90.0.

Things should run a bit faster in 0.90.0 too because disable used to
include an update of .META. per region plus a close of all regions
that make up the table.  In 0.90.0 there is no longer the .META.
update and close is more prompt now; in the past close would wait on
any running compactions to complete before proceeding.  In 0.90.0
we'll no interrupt the running compaction so close happens the sooner.

There is room for a bunch more improvement. For example, deleting a
table, there should be short-circuit that punts on flush of in-memory
state and clean-close of open regions.

> 3) The biggest issue of all is why certain regions do not report back to the
> disable command. What are the various states of a region that could cause
> this? Compaction I know is one, what else could cause the disable command to
> take too long? Shouldn't a disable force itself through and wait long enough
> to be able to disable every region? Again a long wait time or a more
> forceful operation would help.
>

It wasn't that smart in 0.20/0.89.  Its still pretty dumb but better in 0.90.0.

Master process runs the enable/disable process in both old and new
HBase.  In 0.20/0.89, it was a sync process w/ master waiting on
regions to flip to 'offline' after successful close.  The state of
disabledness was when all regions in table had 'offline' state.  Any
hiccup, a problem closing or a failure to update .META. w/ offline per
region would bork the disabling process.  It was super fragile.  We
tried to talk it up as so.

In 0.90, client queues in master an executor that flips table to
disabling in zk and then in parallel sends out unassigns of all table
regions.  The executor then hangs around with a more DDL-like timeout
of hbase.bulk.assignment.waiton.empty.rit (10minutes by default).
Meantime clients can check state of the disable.   After all unassigns
complete, the table is flipped to disabled.


> 4) Through all of the attempts to disable I saw regions coming and going and
> nothing was consistent. The UI showed the table as disabled and listed 1
> region in the table (there were 1000s). The node view listed several other
> regions but not the same one as the table view. It was a very strange
> situation. The UI to browse the tables and regions is great but it would be
> even better if it gave a 100% view of regions and their current states. A
> summary view of region counts per table based on state or status would be
> fantastic.

Please file a JIRA.  Sounds like good idea.  We could hoist stuff up
out of hbck tool up into UI.


> There is a compaction count, but what about in split, read/rite
> lock, disabled, etc. What is the precise list of regions states that could
> occur and show a summary count per state as well as detailed state for each
> specific region in the list. Fundamentally this is the health monitor of the
> system and as a dba I really need to know the 100% count of regions and
> where they are all at in terms of availability. Are they disabled, blocked
> for writes, blocked for reads, in compaction, etc. etc. If there are various
> states that cause disabling to be blocked it can be reported here so that I
> at least know when a disable command can be executed successfully (and this
> should be documented).
>


Please file a JIRA.  This is great stuff.

Sorry for pain caused messing w/ broke enable/disable.  It should be
better in 0.90 and easier to fix if bugs.

St.Ack


> Thanks
>
> On Thu, Jan 20, 2011 at 9:01 PM, Wayne <[email protected]> wrote:
>
>> I need to delete some tables and I am not sure the best way to do it. The
>> shell does not work. The disable command says it runs ok but every time I
>> run drop or truncate I get an exception that says the table is not
>> disabled.  The UI shows it as disabled but truncate/drop still do not work.
>> I have even tried to restart the cluster as sometimes that makes the disable
>> "stick".
>>
>> What is the best way to delete a table manually? My assumption is that with
>> 10k regions in 3 tables that I need to delete that the shell is not going to
>> work. How can I do this without a completely fresh install of everything?
>> How can the data/tables be removed manually without too much pain?
>>
>> Thanks.
>>
>

Reply via email to