[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361705#comment-16361705 ] Duo Zhang commented on HBASE-19976: --- I think yield is not a easy way for the developers, the retry is in the HTable implementation... And as I said above, the ServerCrashProcedure which carries meta should be high priority and it is in the server queue, the RecoverMeta should also be high priority but it is in the table queue... > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361698#comment-16361698 ] stack commented on HBASE-19976: --- bq. This is a very typical dead lock problem in computer science. Smile. We see it in many forms. Usual response is special channel to handle the 'exception'. Then the number of exceptional behaviors builds up and then we up the number of 'meta' handlers to avoid deadlock in the meta handlers or we add a meta-meta handler. I was wondering if you had a thread dump that showed all handlers occupied. I was thinking all threads blocked occupying procedures so the meta procedure was unable to run was an ugly situation. They should yield. We have dedicated queues -- queues for server tasks, queues for table tasks -- and then within these notions of priority such that high priority are scheduled more frequently than low priority and server tasks before table tasks. As long as Procedures yield, it should work out fine? You think we need to add a new priority dimension to the mix [~Apache9]? The RecoverMetaProcedure is made up of multiple steps (log splitting, assign) and subprocedures. All would run in a single high-priority thread? Thanks. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361685#comment-16361685 ] Duo Zhang commented on HBASE-19976: --- Seems I added the thread dump to wrong place so there is no thread dump when failure... Anyway, see here https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/ {noformat} 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-1(pid=139) run time 31.6840sec 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-2(pid=146) run time 29.5870sec 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-3(pid=150) run time 29.5880sec 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-4(pid=142) run time 31.6870sec 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-5(pid=138) run time 31.6830sec 2018-02-12 04:56:54,563 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-6(pid=140) run time 31.6840sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-7(pid=141) run time 31.6880sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-8(pid=143) run time 31.6890sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-9(pid=137) run time 31.6840sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-10(pid=136) run time 31.6840sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-11(pid=149) run time 29.5880sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-12(pid=148) run time 29.5880sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-13(pid=144) run time 29.5870sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-14(pid=145) run time 29.5870sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-15(pid=147) run time 29.5880sec 2018-02-12 04:56:54,564 WARN [ProcExecTimeout] procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck PEWorker-16(pid=151) run time 29.5890sec {noformat} All procedures are stuck. And let's check all the procedures. {noformat} 2018-02-12 04:56:22,879 INFO [PEWorker-1] procedure.MasterProcedureScheduler(883): pid=139, ppid=130, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=d36808157b0edc272844a07587e3630e testThreeRSAbort testThreeRSAbort,o@\x17\xAB\xCE,1518411364183.d36808157b0edc272844a07587e3630e. 2018-02-12 04:56:24,976 INFO [PEWorker-2] procedure.MasterProcedureScheduler(883): pid=146, ppid=131, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=7726f3d31204e2e60fc38582fefddfdb testThreeRSAbort testThreeRSAbort,f\xAA\x08Y),1518411364183.7726f3d31204e2e60fc38582fefddfdb. 2018-02-12 04:56:24,975 INFO [PEWorker-3] procedure.MasterProcedureScheduler(883): pid=150, ppid=131, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=693d6cddbd1f127dd087ee20def3f081 testThreeRSAbort testThreeRSAbort,s6\x94\xE5\xA4,1518411364183.693d6cddbd1f127dd087ee20def3f081. 2018-02-12 04:56:22,880 INFO [PEWorker-4] procedure.MasterProcedureScheduler(883): pid=142, ppid=130, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=0d0f98adccc5c3430f2981524a9cdd12 testThreeRSAbort testThreeRSAbort,u\xDA\xE8a\x88,1518411364183.0d0f98adccc5c3430f2981524a9cdd12. 2018-02-12 04:56:22,880 INFO [PEWorker-5] procedure.MasterProcedureScheduler(883): pid=138, ppid=130, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=15f6ffa81a9f6f469f917050199b8a8c testThreeRSAbort testThreeRSAbort,g\xFC2\x17\x1B,1518411364183.15f6ffa81a9f6f469f917050199b8a8c. 2018-02-12 04:56:22,880 INFO [PEWorker-6] procedure.MasterProcedureScheduler(883): pid=140, ppid=130, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, region=82b1ce3cc98b041a73162d359366df5d testThreeRSAbort testThreeRSAbort,p\x92Ai\xC0,1518411364183.82b1ce3cc98b041a73162d
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361678#comment-16361678 ] Duo Zhang commented on HBASE-19976: --- And here, since the procedure executor does not know much about the priority we defined in MasterProcesureScheduler since it is general and in another module, I plan to abstract the priority like this: The ProcedureScheduler will return an int number to tell that how many priority levels it has, and in ProcedureExecutor, we will reserve one thread for each of the level except the lowest one. And when polling, we pass the priority level to the ProcedureScheduler to only fetch the procedures which priority is higher. And we can introduce 3 levels in MasterProcedureScheduler, one for meta, one for other system table, and one for all other procedures. Notice that the ServerCrashProcedure should have different priority if it carries meta region or other system regions. Thanks. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361674#comment-16361674 ] Duo Zhang commented on HBASE-19976: --- This is a very typical dead lock problem in computer science. The resources are all held by some processes so we have no chance to schedule other processes, but the running processes need the result of another process to complete, then dead lock. Here the resource is thread. One way to solve this is to reserve a thread to only execute high priority procedures. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361666#comment-16361666 ] stack commented on HBASE-19976: --- No, I'm wrong, RecoverMetaProcedure implements TableProcedureInterface. Do you have thread dump of all stuck Procedures waiting on Master [~Apache9] ? Can I break up this step so RecoverMetaProcedure has a chance to run? > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361621#comment-16361621 ] stack commented on HBASE-19976: --- Ideally we'd schedule RecoverMetaProcedure at the front of the queue. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361619#comment-16361619 ] stack commented on HBASE-19976: --- [~Apache9] RecoverMetaProcedure is relatively new. It is neither a TableProcedure nor a ServerProcedure (oversight?). Could this be the problem? If it were a ServerProcedure, it would be run ahead of everyone? > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361617#comment-16361617 ] stack commented on HBASE-19976: --- [~Apache9] HBASE-18109 is on about the prioritization we currently have and how it is server procedures > table procedures and meta > system > user-space tables. You seeing that RecoverMetaProcedure is not being scheduled though it in its guts is about assigning meta? > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361608#comment-16361608 ] stack commented on HBASE-19976: --- Issue on procedure priority; mostly about how how system tables get assigned of user-space tables. Talks about how priority procedures are scheduled more frequently than lower priority procedures and that high priority procedures get scheduled at the front of the queues. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361605#comment-16361605 ] Duo Zhang commented on HBASE-19976: --- I can give it a shot. Used to modify the executor and scheduler when implementing procedure based replication so I think I’m familiar enough. So do you have other ideas in mind sir? If not, let me try the priority approach first? Thanks. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Assignee: stack >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360984#comment-16360984 ] stack commented on HBASE-19976: --- bq. We need to introduce something like priority, and expose a method in ProcedureScheduler to poll high priority procedure only. Let me take a look. bq. IMHO, the procedure framework is really over design... Yeah, it gets a good bit of criticism. Was taken to a particular state and then not looked at again. It is as yet unfinished, synchronous when the idea was that it would run async, etc. I could start a doc. to accumulate wants and comments. > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360532#comment-16360532 ] Duo Zhang commented on HBASE-19976: --- OK, here the problem is, we can only insert the special logic into MasterProcedureScheduler, and there is no way to have an extra thread to run meta related procedures only since ProcedureExecutor is in hbase-procedure... We need to introduce something like priority, and expose a method in ProcedureScheduler to poll high priority procedure only. Do you have any other ideas in mind sir? [~stack] IMHO, the procedure framework is really over design... > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted
[ https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360473#comment-16360473 ] Duo Zhang commented on HBASE-19976: --- It is not easy to fix since the MasterProcedureScheduler is really really complicated... > Dead lock if the worker threads in procedure executor are exhausted > --- > > Key: HBASE-19976 > URL: https://issues.apache.org/jira/browse/HBASE-19976 > Project: HBase > Issue Type: Bug >Reporter: Duo Zhang >Priority: Critical > > See the comments in HBASE-19554. If all the worker threads are stuck in > AssignProcdure since meta region is offline, then the RecoverMetaProcedure > can not be executed and cause dead lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)