[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256574#comment-16256574
 ] 

stack commented on HBASE-19216:
---

bq. But I would like to make the reportProcedureDone more general. So procedure 
id will always be presented, and also a serialized protobuf message. We can 
encode the peer id in the protobuf message?

Yeah. This makes sense. Finding a procedure with a pid would be best -- most 
general -- but we don't have a lookup at mo. Let me check it out. And then 
there suspend is done w/ the ProcedureEvent which is apart from Procedure (as 
you say above).

And I like the way you are trying to do a general soln because this 'bus', once 
open, will be flooded w/ all sorts of cluster messaging 

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256394#comment-16256394
 ] 

Duo Zhang commented on HBASE-19216:
---

Yes, we need something like a ReplicationPeers to keep all the peers at master 
side, and also prevent concurrent modification on a give peer.

But I would like to make the reportProcedureDone more general. So procedure id 
will always be presented, and also a serialized protobuf message. We can encode 
the peer id in the protobuf message?

Anyway, it seems that we need to find the stored procedure event if we want to 
wake up a procedure. Got it. Let me try to implement a create peer procedure.

Thanks sir, help a lot.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256287#comment-16256287
 ] 

stack commented on HBASE-19216:
---

bq. For a peer change, I think it is idempotent, so we can retry forever if an 
RS fails to report in.

Ok. We just need to stop pinging if the server goes away.

bq. I plan to add a reportProcedureDone method in RegionServerStatusService

Ok. Should do for a few procedure types.

bq. How can I wake up a suspended procedure?

In Assign/Unassign, we have RegionStateNodes that have in them a reference to 
the Procedure that is manipulating the RS and an associated ProcedureEvent.  
Suspend/resume operates on the RSN PE. Before we dispatch an RPC, we do a 
suspend on the RSN PE. When RS has transitioned the Region, it updates master 
by calling reportRegionStateTransition.  Master finds the pertinent RSN using 
RegionInfo as key. We pull out the Procedure and call reportTransition on it. 
After updating state in the Procedure, the last thing done is a wake up call on 
the PE.

We'd have a registry of Peers in Master (ReplicationPeers?) keyed by peerid?. 
The Peer in Master would carry Procedure and PE reference.

Something like that.

bq. I need to create one by myself when suspending the procedure and store it 
in the procedure, so I can get it through the procedureId?

When we create a Peer, it would have in it a PE. The PE would not be created 
each time we want to do a suspend because we want to guard against having more 
than one operation going on against a Peer at a time. The key could be 
procedureid but could it be peerid instead?




So, setting peer would work like 


> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-16 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255299#comment-16255299
 ] 

Duo Zhang commented on HBASE-19216:
---

Ping [~stack]. Need help sir.

Thanks.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-15 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253478#comment-16253478
 ] 

Duo Zhang commented on HBASE-19216:
---

I plan to add a reportProcedureDone method in RegionServerStatusService. Fro 
the failing path I think the current framework can work well. We can retry for 
ever if the remote procedure call can not be sent, and finally a 
remoteCallFailed will be triggered and we can give up retrying.

But for the normal path, I can get a full picture but some details are still 
behind the misty. I plan to add a procedureId in the request, and RS will 
report back the procedureId when done. We can get a procedure with this 
procedureId, but then I'm a little confused. How can I wake up a suspended 
procedure? There seems to be a ProcedureEvent, then how is it generated, and 
how can I get it when I only have a procedureId? I need to create one by myself 
when suspending the procedure and store it in the procedure, so I can get it 
through the procedureId?

Help expected... Still a beginner on the procedure v2 framework... Thanks sir 
[~stack].

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-13 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251013#comment-16251013
 ] 

Duo Zhang commented on HBASE-19216:
---

For a peer change, I think it is idempotent, so we can retry forever if an RS 
fails to report in. We can have a nonce to prevent useless refresh but an extra 
refresh will not effect correctness. So here I would say that there is no 
rollback for this procedure.

And for ServerCrashProcedure, we can just skip the refresh on this node as it 
will load the new peer config when restarting. Also for 
ServerNotRunningYetException, maybe we have the same logic for unassign region?

{quote}
When you need this by?
{quote}
The synchronous replication needs this for state transition. The current 
'eventually done' semantic for changing the peer config is not enough. So, the 
earlier the better :)
Anyway, I can do it by myself, but I need to confirm that my approach is 
correct.

Thanks.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250989#comment-16250989
 ] 

stack commented on HBASE-19216:
---

Sounds fine. The steps you describe resemble how open/close region work 
(AssignProcedure and UnassignProcedure) except you are doing a fan out to all 
RegionServers and then they are all to phone home to the Master when done?

What if all RS fail to report in?  Procedures need to complete. Could 'fail' 
and we do 'rollback/cleanup'?

What if a RS dies? ServerCrashProcedure cleans up stuck Assign/Unassigns. Could 
do similar.

Currently, open region (AssignProcedure) schedules remote request. The 
Procedure then suspends itself. The RS RPCs to Master to tell it Region open. 
The Master then 'wakes up' the suspended procedure to proceed (or fail).

When you need this by?

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-13 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250952#comment-16250952
 ] 

Duo Zhang commented on HBASE-19216:
---

I would like to execute the procedure like this:

1. Execute a AddPeerProcedure.
2. Return several RefreshPeerConfigProcedures as sub procedure which are remote 
procedures. And the parent procedure is suspended.
3. For each RefreshPeerConfigProcedure, we will schedule a remote request, and 
it will be sent by calling executeProcedure. The procedure will also be 
suspended after scheduling the call.
4. We will have a way to wake up the RefreshPeerConfigProcedure and tell it 
that the refresh is done on RS(this is the problem here).
5. After all the sub procedure are done, we wake up the AddPeerProcedure.
6. AddPeerProcedure will do some post works(like logging) and finish.

So the decision here is that, executeProcedure should be an asynchronous call, 
and the return value is useless. When the procedure is done at RS side, it will 
call a method to notify the master?

Thanks.



> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250923#comment-16250923
 ] 

stack commented on HBASE-19216:
---

Pv2 could be good for this. RS would report to the Master when+seqid they'd 
completed a peer operation (like a RegionServer reporting it had opened a 
Region)? Would that work?

Outline what the steps you think involved and then let me give you pointers. 
Our doc is incomplete [1] and it is not straight-forward (unfortunately) how 
stuff works. Let me try and save you some head-banging. [~Apache9]

1. 
https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.xp9zndoycwj

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-13 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250683#comment-16250683
 ] 

Duo Zhang commented on HBASE-19216:
---

Ping [~stack]. Need to know the state of procedure v2 and if there is already a 
plan to improve it? If no I will try to fix it by myself.

Thanks.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-12 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248858#comment-16248858
 ] 

Duo Zhang commented on HBASE-19216:
---

Seems the procedure framework is not done yet? I followed the 
AssignRegionProcedure to write my RefreshPeerConfigProcedure, and then I found 
that, there is no way to get the response for a RemoteProcedure call!

[~stack] Is there an existing issue to track this sir? The synchronous 
replication will depend on the procedure framework very very heavily, so I need 
to the current state... I need to make it done first before implementing 
synchronous replication.

Thanks.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-09 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245639#comment-16245639
 ] 

Duo Zhang commented on HBASE-19216:
---

Need to be familiar with the procedure code. Seems a 
ReloadReplicationPeerConfigProcedure is enough for all replication related 
operations.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations

2017-11-08 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243837#comment-16243837
 ] 

Duo Zhang commented on HBASE-19216:
---

I think this is independent from synchronous replication so create a separated 
issue for it.

> Use procedure to execute replication peer related operations
> 
>
> Key: HBASE-19216
> URL: https://issues.apache.org/jira/browse/HBASE-19216
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> When building the basic framework for HBASE-19064, I found that the 
> enable/disable peer is built upon the watcher of zk.
> The problem of using watcher is that, you do not know the exact time when all 
> RSes in the cluster have done the change, it is a 'eventually done'. 
> And for synchronous replication, when changing the state of a replication 
> peer, we need to know the exact time as we can only enable read/write after 
> that time. So I think we'd better use procedure to do this. Change the flag 
> on zk, and then execute a procedure on all RSes to reload the flag from zk.
> Another benefit is that, after the change, zk will be mainly used as a 
> storage, so it will be easy to implement another replication peer storage to 
> replace zk so that we can reduce the dependency on zk.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)