[jira] [Updated] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock

2017-02-16 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-17653:
--
Attachment: HBASE-17653.master.003.patch

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> 
>
> Key: HBASE-17653
> URL: https://issues.apache.org/jira/browse/HBASE-17653
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: stack
>Assignee: stack
> Attachments: HBASE-17653.master.001.patch, 
> HBASE-17653.master.002.patch, HBASE-17653.master.003.patch
>
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
> scenario under which we  may end up in a deadlock (distributed). Let me 
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock 
> but can't. rpc thread task won't release monitor lock since rsgroup region 
> never gets assigned. rsgroup region never gets assigned because it can't 
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and 
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have 
> writes block reads.
> 
> In the code prior to this patch the getter methods that retrieve group 
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor 
> lock so the deadlock cycle is broken.
> 
> The methods that does mutations and updates to zk and hbase:rsgroup are 
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock

2017-02-16 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-17653:
--
Attachment: HBASE-17653.master.002.patch

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> 
>
> Key: HBASE-17653
> URL: https://issues.apache.org/jira/browse/HBASE-17653
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: stack
>Assignee: stack
> Attachments: HBASE-17653.master.001.patch, 
> HBASE-17653.master.002.patch
>
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
> scenario under which we  may end up in a deadlock (distributed). Let me 
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock 
> but can't. rpc thread task won't release monitor lock since rsgroup region 
> never gets assigned. rsgroup region never gets assigned because it can't 
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and 
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have 
> writes block reads.
> 
> In the code prior to this patch the getter methods that retrieve group 
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor 
> lock so the deadlock cycle is broken.
> 
> The methods that does mutations and updates to zk and hbase:rsgroup are 
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock

2017-02-16 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-17653:
--
Attachment: HBASE-17653.master.001.patch

> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> 
>
> Key: HBASE-17653
> URL: https://issues.apache.org/jira/browse/HBASE-17653
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: stack
>Assignee: stack
> Attachments: HBASE-17653.master.001.patch
>
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
> scenario under which we  may end up in a deadlock (distributed). Let me 
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock 
> but can't. rpc thread task won't release monitor lock since rsgroup region 
> never gets assigned. rsgroup region never gets assigned because it can't 
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and 
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have 
> writes block reads.
> 
> In the code prior to this patch the getter methods that retrieve group 
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor 
> lock so the deadlock cycle is broken.
> 
> The methods that does mutations and updates to zk and hbase:rsgroup are 
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HBASE-17653) HBASE-17624 rsgroup synchronizations will (distributed) deadlock

2017-02-15 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-17653:
--
Description: 
Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
scenario under which we  may end up in a deadlock (distributed). Let me repeat 
[~toffer] comment...

{code}
Both read/write access can't be single threaded. Consider the situation:
1. move_rsgroup_servers is called
2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
holds monitor lock)
3. while #2 is happening meta is in transition.
Balancer tries to figure out plan for meta region tries to get monitor lock but 
can't. rpc thread task won't release monitor lock since rsgroup region never 
gets assigned. rsgroup region never gets assigned because it can't update meta 
with new state.
There's a good chance this can be reproduce just by moving both rsgroup and 
meta region onto the same RS and call move_rsgoup_servers on the same RS.
A bunch different actors will query from group affiliation so we can't have 
writes block reads.



In the code prior to this patch the getter methods that retrieve group 
information (getRSGroup, ofTable, OfServer, etc) don't require the monitor lock 
so the deadlock cycle is broken.




The methods that does mutations and updates to zk and hbase:rsgroup are 
synchronized appropriately. Point me to where the incoherence is?
{code}

This issue is about testing/fixing/restoring rsgroup access. Will be back.

  was:Follow-on from HBASE-17624. HBASE-17624 made it so one thread access to 
the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes scenario 
under which we  may end up in a deadlock (distributed). This issue is to 
address this problem.


> HBASE-17624 rsgroup synchronizations will (distributed) deadlock
> 
>
> Key: HBASE-17653
> URL: https://issues.apache.org/jira/browse/HBASE-17653
> Project: HBase
>  Issue Type: Bug
>  Components: rsgroup
>Reporter: stack
>Assignee: stack
>
> Follow-on from HBASE-17624. HBASE-17624 made it so one thread only has access 
> to the rsgroup administrator. In tail of HBASE-17624 [~toffer] describes 
> scenario under which we  may end up in a deadlock (distributed). Let me 
> repeat [~toffer] comment...
> {code}
> Both read/write access can't be single threaded. Consider the situation:
> 1. move_rsgroup_servers is called
> 2. while #1 is happening rsgroup region is in transition (rpc thread in #1 
> holds monitor lock)
> 3. while #2 is happening meta is in transition.
> Balancer tries to figure out plan for meta region tries to get monitor lock 
> but can't. rpc thread task won't release monitor lock since rsgroup region 
> never gets assigned. rsgroup region never gets assigned because it can't 
> update meta with new state.
> There's a good chance this can be reproduce just by moving both rsgroup and 
> meta region onto the same RS and call move_rsgoup_servers on the same RS.
> A bunch different actors will query from group affiliation so we can't have 
> writes block reads.
> 
> In the code prior to this patch the getter methods that retrieve group 
> information (getRSGroup, ofTable, OfServer, etc) don't require the monitor 
> lock so the deadlock cycle is broken.
> 
> The methods that does mutations and updates to zk and hbase:rsgroup are 
> synchronized appropriately. Point me to where the incoherence is?
> {code}
> This issue is about testing/fixing/restoring rsgroup access. Will be back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)