[jira] [Comment Edited] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

Viraj Jasani (Jira) Tue, 19 Dec 2023 10:20:08 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798696#comment-17798696
 ]


Viraj Jasani edited comment on HBASE-28271 at 12/19/23 6:19 PM:
----------------------------------------------------------------

LockProcedure implementation at high level:

Just like any procedure, first it tries to acquire lock => lock acquired (here 
the lock is it's own lock implementation i.e. exclusive/shared locks at 
table/namespace/region level).

Only if lock is acquired, the execution begins as per the generic logic:
{code:java}
LockState lockState = acquireLock(proc);
switch (lockState) {
  case LOCK_ACQUIRED:
    execProcedure(procStack, proc);
    break;
  case LOCK_YIELD_WAIT:
    LOG.info(lockState + " " + proc);
    scheduler.yield(proc);
    break;
  case LOCK_EVENT_WAIT:
    // Someone will wake us up when the lock is available
    LOG.debug(lockState + " " + proc);
    break;
  default:
    throw new UnsupportedOperationException();
} {code}
For LockProc, only when it is executed, the latch is accessed. This is the way 
snapshot ensures that the lock at the table level is already acquired and it 
can move forward with creating snapshot.


was (Author: vjasani):
LockProcedure implementation at high level:

Just like any procedure, first it tries to acquire lock => lock acquired.

Only if lock is acquired, the execution begins as per the generic logic:
{code:java}
LockState lockState = acquireLock(proc);
switch (lockState) {
  case LOCK_ACQUIRED:
    execProcedure(procStack, proc);
    break;
  case LOCK_YIELD_WAIT:
    LOG.info(lockState + " " + proc);
    scheduler.yield(proc);
    break;
  case LOCK_EVENT_WAIT:
    // Someone will wake us up when the lock is available
    LOG.debug(lockState + " " + proc);
    break;
  default:
    throw new UnsupportedOperationException();
} {code}
For LockProc, only when it is executed, the latch is accessed. This is the way 
snapshot ensures that the lock at the table level is already acquired and it 
can move forward with creating snapshot.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-28271
>                 URL: https://issues.apache.org/jira/browse/HBASE-28271
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>         Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-28271) Infinite waiting on lock acquisition by snapshot can result in unresponsive master

Reply via email to