[ 
https://issues.apache.org/jira/browse/HBASE-20366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430938#comment-16430938
 ] 

stack commented on HBASE-20366:
-------------------------------

After more study, the move procedure's unassign has not finished. It has set 
CLOSED in hbase:meta but the procedure has not yet completed. The move 
procedure is suspended waiting on the unassign to complete so it can move to 
the assign step. It wants the unassign to let go of the region lock before it 
can more on.

This check for RUNNABLE state seems too constrained. We should let through the 
suspended procedures.

Let me keep an eye out for this failure type.

> Procedure State != ProcedureState.RUNNABLE; IllegalArgumentException
> --------------------------------------------------------------------
>
>                 Key: HBASE-20366
>                 URL: https://issues.apache.org/jira/browse/HBASE-20366
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: stack
>            Priority: Critical
>
> PE Worker dies and Region offlined because Procedure not runable when 
> procedure goes to run it. It looks like this:
> {code}
> 2018-04-07 19:58:50,589 INFO  [PEWorker-5] 
> procedure.MasterProcedureScheduler: pid=8304, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=IntegrationTestBigLinkedList,p\xC3\x11\xB2,1523155040553.187ee18fb3dd1a7ac1f9f2b667160729.,
>  source=ve0534.halxg.cloudera.com,16020,1523153184521, 
> destination=ve0542.halxg.cloudera.com,16020,1523155964184 checking lock on 
> 187ee18fb3dd1a7ac1f9f2b667160729
> 2018-04-07 19:58:50,589 INFO  [PEWorker-14] 
> procedure.MasterProcedureScheduler: pid=8302, 
> state=RUNNABLE:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=IntegrationTestBigLinkedList,\xEC0\x83\x96*\x86Qsh\xD82\x1E\xAB\x06$\x89,1523151456082.84e97ce42aeb78a2abaf8f17a278b735.,
>  source=ve0534.halxg.cloudera.com,16020,1523153184521, 
> destination=ve0542.halxg.cloudera.com,16020,1523155964184 checking lock on 
> 84e97ce42aeb78a2abaf8f17a278b735                                              
>                                                                               
>                                                             2018-04-07 
> 19:58:50,591 WARN  [PEWorker-5] procedure2.ProcedureExecutor: Worker 
> terminating UNNATURALLY null
> java.lang.IllegalArgumentException: pid=8304, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=IntegrationTestBigLinkedList,p\xC3\x11\xB2,1523155040553.187ee18fb3dd1a7ac1f9f2b667160729.,
>  source=ve0534.halxg.cloudera.com,16020,1523153184521, 
> destination=ve0542.halxg.cloudera.com,16020,1523155964184
>   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1430)
>                                                                               
>                                                                               
>                                                       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
> {code}
> This killed my job because it offlined a region.
> Narrative:
>  * Balancer moves this region....
>  * Move procedure does dispatch to unassign...
>  * Suspiciously, the close comes in unannounced.. .its as though it a close 
> from another procedure...
>  2018-04-07 19:58:24,296 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=8305 updating hbase:meta 
> row=IntegrationTestBigLinkedList,p\xC3\x11\xB2,1523155040553.187ee18fb3dd1a7ac1f9f2b667160729.,
>  regionState=CLOSED
>  * Master is killed by monkey.
>  * Recovery. Region is in CLOSED state.
>  * We go to schedule the move region procedure again... Its state must have 
> not been updated on master crash.
>  2018-04-07 19:58:50,589 INFO  [PEWorker-5] 
> procedure.MasterProcedureScheduler: pid=8304, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=IntegrationTestBigLinkedList,p\xC3\x11\xB2,1523155040553.187ee18fb3dd1a7ac1f9f2b667160729.,
>  source=ve0534.halxg.cloudera.com,16020,1523153184521, 
> destination=ve0542.halxg.cloudera.com,16020,1523155964184 checking lock on 
> 187ee18fb3dd1a7ac1f9f2b667160729
>  * And then we get
>  2018-04-07 19:58:50,591 WARN  [PEWorker-5] procedure2.ProcedureExecutor: 
> Worker terminating UNNATURALLY null                                           
>                                                                               
>                                                                               
>  java.lang.IllegalArgumentException: pid=8304, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=IntegrationTestBigLinkedList,p\xC3\x11\xB2,1523155040553.187ee18fb3dd1a7ac1f9f2b667160729.,
>  source=ve0534.halxg.cloudera.com,16020,1523153184521, 
> destination=ve0542.halxg.cloudera.com,16020,1523155964184   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>                                                                               
>                                                                               
>                                                at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1430)
>                                                                               
>                                                                               
>                                                       at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
>                                                                               
>                                                                               
>                                                    at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
>                                                                               
>                                                                               
>                                                            at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to