[ 
https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540469#comment-13540469
 ] 

Matteo Bertozzi commented on HBASE-7451:
----------------------------------------

Thanks for taking a look at the code.

In the last months that code was refactored and simplified, and now the 
3phase-commit doesn't exist anymore. Take a look at "HBASE-7212 Globally 
Barriered Procedure Mechanism" that is the replacement for that (review here: 
https://reviews.apache.org/r/8240/). 

globally consistent snapshot was also put aside, and it will not be in the 
first snapshot cut.

You can find the full branch containing the new procedure mechanism and the new 
online snapshot code here: 
https://github.com/jmhsieh/hbase/commits/snapshots-work
                
> [snapshot] regionserver will deadlock when GlobalSnapshotOperation timeout 
> happens
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-7451
>                 URL: https://issues.apache.org/jira/browse/HBASE-7451
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>            Reporter: terry zhang
>            Assignee: terry zhang
>
> Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github 
> : https://github.com/matteobertozzi/hbase/
> If we create a snapshot and meet regionserver timeout. Rs will be lock and 
> can not put any data. Please take a look at log below :
> // regionserver snapshot timeout
> org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: 
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
>  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> at 
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
> at 
> org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
> at 
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
> at 
> org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
> at 
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> Caused by: 
> org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
>  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
> ... 3 more
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
> //Waiting for 'commit allowed' latch and do not exist
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
> 2012-12-26 18:44:57,211 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Cleanup snapshot - handled in sub-tasks on error
> 2012-12-26 18:44:57,212 DEBUG 
> org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
> 2012-12-26 18:44:57,212 DEBUG 
> org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
> Finish snapshot - handling in subtasks on error
> 2012-12-26 18:44:57,212 WARN 
> org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer 
> already marked completed, ignoring!
> 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:17,002 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,002 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Recieved start event.
> 2012-12-26 18:45:17,002 DEBUG 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Looking for new operations under 
> znode:/hbase-TERRY-73/online-snapshot/prepare
> 2012-12-26 18:45:17,003 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Received children changed event:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:17,003 INFO 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Recieved abort event.
> 2012-12-26 18:45:17,003 DEBUG 
> org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
>  Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort
> 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:41,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:43,481 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
> LRU Stats: total=11.77 MB, free=1.39 GB, max=1.4 GB, blocks=5, accesses=96, 
> hits=91, hitRatio=94.79%, cachingAccesses=96, cachingHits=91, 
> cachingHitsRatio=94.79%, evictions=0, evicted=0, evictedPerRun=NaN
> 2012-12-26 18:45:46,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:51,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:45:56,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:01,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:06,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:11,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:16,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:21,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:26,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:31,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:36,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)
> 2012-12-26 18:46:41,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
> for 'commit allowed' latch. (sleep:5000 ms)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to