[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361705#comment-16361705
 ] 

Duo Zhang commented on HBASE-19976:
---

I think yield is not a easy way for the developers, the retry is in the HTable 
implementation...

And as I said above, the ServerCrashProcedure which carries meta should be high 
priority and it is in the server queue, the RecoverMeta should also be high 
priority but it is in the table queue...

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361698#comment-16361698
 ] 

stack commented on HBASE-19976:
---

bq. This is a very typical dead lock problem in computer science.

Smile. We see it in many forms. Usual response is special channel to handle the 
'exception'. Then the number of exceptional behaviors builds up and then we up 
the number of 'meta' handlers to avoid deadlock in the meta handlers or we add 
a meta-meta handler.

I was wondering if you had a thread dump that showed all handlers occupied. I 
was thinking all threads blocked occupying procedures so the meta procedure was 
unable to run was an ugly situation. They should yield.

We have dedicated queues -- queues for server tasks, queues for table tasks -- 
and then within these notions of priority such that high priority are scheduled 
more frequently than low priority and server tasks before table tasks.  As long 
as Procedures yield, it should work out fine? You think we need to add a new 
priority dimension to the mix [~Apache9]?  The RecoverMetaProcedure is made up 
of multiple steps (log splitting, assign) and subprocedures. All would run in a 
single high-priority thread? Thanks.



> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361685#comment-16361685
 ] 

Duo Zhang commented on HBASE-19976:
---

Seems I added the thread dump to wrong place so there is no thread dump when 
failure...

Anyway, see here

https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/

{noformat}
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-1(pid=139) run time 31.6840sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-2(pid=146) run time 29.5870sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-3(pid=150) run time 29.5880sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-4(pid=142) run time 31.6870sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-5(pid=138) run time 31.6830sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-6(pid=140) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-7(pid=141) run time 31.6880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-8(pid=143) run time 31.6890sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-9(pid=137) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-10(pid=136) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-11(pid=149) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-12(pid=148) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-13(pid=144) run time 29.5870sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-14(pid=145) run time 29.5870sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-15(pid=147) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-16(pid=151) run time 29.5890sec
{noformat}

All procedures are stuck. And let's check all the procedures.
{noformat}
2018-02-12 04:56:22,879 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(883): pid=139, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=d36808157b0edc272844a07587e3630e testThreeRSAbort 
testThreeRSAbort,o@\x17\xAB\xCE,1518411364183.d36808157b0edc272844a07587e3630e.
2018-02-12 04:56:24,976 INFO  [PEWorker-2] 
procedure.MasterProcedureScheduler(883): pid=146, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=7726f3d31204e2e60fc38582fefddfdb testThreeRSAbort 
testThreeRSAbort,f\xAA\x08Y),1518411364183.7726f3d31204e2e60fc38582fefddfdb.
2018-02-12 04:56:24,975 INFO  [PEWorker-3] 
procedure.MasterProcedureScheduler(883): pid=150, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=693d6cddbd1f127dd087ee20def3f081 testThreeRSAbort 
testThreeRSAbort,s6\x94\xE5\xA4,1518411364183.693d6cddbd1f127dd087ee20def3f081.
2018-02-12 04:56:22,880 INFO  [PEWorker-4] 
procedure.MasterProcedureScheduler(883): pid=142, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=0d0f98adccc5c3430f2981524a9cdd12 testThreeRSAbort 
testThreeRSAbort,u\xDA\xE8a\x88,1518411364183.0d0f98adccc5c3430f2981524a9cdd12.
2018-02-12 04:56:22,880 INFO  [PEWorker-5] 
procedure.MasterProcedureScheduler(883): pid=138, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=15f6ffa81a9f6f469f917050199b8a8c testThreeRSAbort 
testThreeRSAbort,g\xFC2\x17\x1B,1518411364183.15f6ffa81a9f6f469f917050199b8a8c.
2018-02-12 04:56:22,880 INFO  [PEWorker-6] 
procedure.MasterProcedureScheduler(883): pid=140, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=82b1ce3cc98b041a73162d359366df5d testThreeRSAbort 
testThreeRSAbort,p\x92Ai\xC0,1518411364183.82b1ce3cc98b041a73162d

[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361678#comment-16361678
 ] 

Duo Zhang commented on HBASE-19976:
---

And here, since the procedure executor does not know much about the priority we 
defined in MasterProcesureScheduler since it is general and in another module, 
I plan to abstract the priority like this:

The ProcedureScheduler will return an int number to tell that how many priority 
levels it has, and in ProcedureExecutor, we will reserve one thread for each of 
the level except the lowest one. And when polling, we pass the priority level 
to the ProcedureScheduler to only fetch the procedures which priority is higher.

And we can introduce 3 levels in MasterProcedureScheduler, one for meta, one 
for other system table, and one for all other procedures. Notice that the 
ServerCrashProcedure should have different priority if it carries meta region 
or other system regions.

Thanks.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361674#comment-16361674
 ] 

Duo Zhang commented on HBASE-19976:
---

This is a very typical dead lock problem in computer science. The resources are 
all held by some processes so we have no chance to schedule other processes, 
but the running processes need the result of another process to complete, then 
dead lock. Here the resource is thread.

One way to solve this is to reserve a thread to only execute high priority 
procedures.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361666#comment-16361666
 ] 

stack commented on HBASE-19976:
---

No, I'm wrong, RecoverMetaProcedure implements TableProcedureInterface. Do you 
have thread dump of all stuck Procedures waiting on Master [~Apache9] ? Can I 
break up this step so RecoverMetaProcedure has a chance to run?

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361621#comment-16361621
 ] 

stack commented on HBASE-19976:
---

Ideally we'd schedule RecoverMetaProcedure at the front of the queue.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361619#comment-16361619
 ] 

stack commented on HBASE-19976:
---

[~Apache9] RecoverMetaProcedure is relatively new. It is neither a 
TableProcedure nor a ServerProcedure (oversight?). Could this be the problem? 
If it were a ServerProcedure, it would be run ahead of everyone?

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361617#comment-16361617
 ] 

stack commented on HBASE-19976:
---

[~Apache9] HBASE-18109 is on about the prioritization we currently have and how 
it is server procedures > table procedures and meta > system > user-space 
tables. You seeing that RecoverMetaProcedure is not being scheduled though it 
in its guts is about assigning meta?

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361608#comment-16361608
 ] 

stack commented on HBASE-19976:
---

Issue on procedure priority; mostly about how how system tables get assigned of 
user-space tables. Talks about how priority procedures are scheduled more 
frequently than lower priority procedures and that high priority procedures get 
scheduled at the front of the queues.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361605#comment-16361605
 ] 

Duo Zhang commented on HBASE-19976:
---

I can give it a shot. Used to modify the executor and scheduler when 
implementing procedure based replication so I think I’m familiar enough.

So do you have other ideas in mind sir? If not, let me try the priority 
approach first?

Thanks.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Assignee: stack
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360984#comment-16360984
 ] 

stack commented on HBASE-19976:
---

bq. We need to introduce something like priority, and expose a method in 
ProcedureScheduler to poll high priority procedure only.

Let me take a look.

bq. IMHO, the procedure framework is really over design...

Yeah, it gets a good bit of criticism. Was taken to a particular state and then 
not looked at again. It is as yet unfinished, synchronous when the idea was 
that it would run async, etc. I could start a doc. to accumulate wants and 
comments.

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360532#comment-16360532
 ] 

Duo Zhang commented on HBASE-19976:
---

OK, here the problem is, we can only insert the special logic into 
MasterProcedureScheduler, and there is no way to have an extra thread to run 
meta related procedures only since ProcedureExecutor is in hbase-procedure...

We need to introduce something like priority, and expose a method in 
ProcedureScheduler to poll high priority procedure only.

Do you have any other ideas in mind sir? [~stack]
IMHO, the procedure framework is really over design...


> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-19976) Dead lock if the worker threads in procedure executor are exhausted

2018-02-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360473#comment-16360473
 ] 

Duo Zhang commented on HBASE-19976:
---

It is not easy to fix since the MasterProcedureScheduler is really really 
complicated...

> Dead lock if the worker threads in procedure executor are exhausted
> ---
>
> Key: HBASE-19976
> URL: https://issues.apache.org/jira/browse/HBASE-19976
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)