[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256574#comment-16256574 ] stack commented on HBASE-19216: --- bq. But I would like to make the reportProcedureDone more general. So procedure id will always be presented, and also a serialized protobuf message. We can encode the peer id in the protobuf message? Yeah. This makes sense. Finding a procedure with a pid would be best -- most general -- but we don't have a lookup at mo. Let me check it out. And then there suspend is done w/ the ProcedureEvent which is apart from Procedure (as you say above). And I like the way you are trying to do a general soln because this 'bus', once open, will be flooded w/ all sorts of cluster messaging > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256394#comment-16256394 ] Duo Zhang commented on HBASE-19216: --- Yes, we need something like a ReplicationPeers to keep all the peers at master side, and also prevent concurrent modification on a give peer. But I would like to make the reportProcedureDone more general. So procedure id will always be presented, and also a serialized protobuf message. We can encode the peer id in the protobuf message? Anyway, it seems that we need to find the stored procedure event if we want to wake up a procedure. Got it. Let me try to implement a create peer procedure. Thanks sir, help a lot. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256287#comment-16256287 ] stack commented on HBASE-19216: --- bq. For a peer change, I think it is idempotent, so we can retry forever if an RS fails to report in. Ok. We just need to stop pinging if the server goes away. bq. I plan to add a reportProcedureDone method in RegionServerStatusService Ok. Should do for a few procedure types. bq. How can I wake up a suspended procedure? In Assign/Unassign, we have RegionStateNodes that have in them a reference to the Procedure that is manipulating the RS and an associated ProcedureEvent. Suspend/resume operates on the RSN PE. Before we dispatch an RPC, we do a suspend on the RSN PE. When RS has transitioned the Region, it updates master by calling reportRegionStateTransition. Master finds the pertinent RSN using RegionInfo as key. We pull out the Procedure and call reportTransition on it. After updating state in the Procedure, the last thing done is a wake up call on the PE. We'd have a registry of Peers in Master (ReplicationPeers?) keyed by peerid?. The Peer in Master would carry Procedure and PE reference. Something like that. bq. I need to create one by myself when suspending the procedure and store it in the procedure, so I can get it through the procedureId? When we create a Peer, it would have in it a PE. The PE would not be created each time we want to do a suspend because we want to guard against having more than one operation going on against a Peer at a time. The key could be procedureid but could it be peerid instead? So, setting peer would work like > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255299#comment-16255299 ] Duo Zhang commented on HBASE-19216: --- Ping [~stack]. Need help sir. Thanks. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253478#comment-16253478 ] Duo Zhang commented on HBASE-19216: --- I plan to add a reportProcedureDone method in RegionServerStatusService. Fro the failing path I think the current framework can work well. We can retry for ever if the remote procedure call can not be sent, and finally a remoteCallFailed will be triggered and we can give up retrying. But for the normal path, I can get a full picture but some details are still behind the misty. I plan to add a procedureId in the request, and RS will report back the procedureId when done. We can get a procedure with this procedureId, but then I'm a little confused. How can I wake up a suspended procedure? There seems to be a ProcedureEvent, then how is it generated, and how can I get it when I only have a procedureId? I need to create one by myself when suspending the procedure and store it in the procedure, so I can get it through the procedureId? Help expected... Still a beginner on the procedure v2 framework... Thanks sir [~stack]. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251013#comment-16251013 ] Duo Zhang commented on HBASE-19216: --- For a peer change, I think it is idempotent, so we can retry forever if an RS fails to report in. We can have a nonce to prevent useless refresh but an extra refresh will not effect correctness. So here I would say that there is no rollback for this procedure. And for ServerCrashProcedure, we can just skip the refresh on this node as it will load the new peer config when restarting. Also for ServerNotRunningYetException, maybe we have the same logic for unassign region? {quote} When you need this by? {quote} The synchronous replication needs this for state transition. The current 'eventually done' semantic for changing the peer config is not enough. So, the earlier the better :) Anyway, I can do it by myself, but I need to confirm that my approach is correct. Thanks. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250989#comment-16250989 ] stack commented on HBASE-19216: --- Sounds fine. The steps you describe resemble how open/close region work (AssignProcedure and UnassignProcedure) except you are doing a fan out to all RegionServers and then they are all to phone home to the Master when done? What if all RS fail to report in? Procedures need to complete. Could 'fail' and we do 'rollback/cleanup'? What if a RS dies? ServerCrashProcedure cleans up stuck Assign/Unassigns. Could do similar. Currently, open region (AssignProcedure) schedules remote request. The Procedure then suspends itself. The RS RPCs to Master to tell it Region open. The Master then 'wakes up' the suspended procedure to proceed (or fail). When you need this by? > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250952#comment-16250952 ] Duo Zhang commented on HBASE-19216: --- I would like to execute the procedure like this: 1. Execute a AddPeerProcedure. 2. Return several RefreshPeerConfigProcedures as sub procedure which are remote procedures. And the parent procedure is suspended. 3. For each RefreshPeerConfigProcedure, we will schedule a remote request, and it will be sent by calling executeProcedure. The procedure will also be suspended after scheduling the call. 4. We will have a way to wake up the RefreshPeerConfigProcedure and tell it that the refresh is done on RS(this is the problem here). 5. After all the sub procedure are done, we wake up the AddPeerProcedure. 6. AddPeerProcedure will do some post works(like logging) and finish. So the decision here is that, executeProcedure should be an asynchronous call, and the return value is useless. When the procedure is done at RS side, it will call a method to notify the master? Thanks. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250923#comment-16250923 ] stack commented on HBASE-19216: --- Pv2 could be good for this. RS would report to the Master when+seqid they'd completed a peer operation (like a RegionServer reporting it had opened a Region)? Would that work? Outline what the steps you think involved and then let me give you pointers. Our doc is incomplete [1] and it is not straight-forward (unfortunately) how stuff works. Let me try and save you some head-banging. [~Apache9] 1. https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.xp9zndoycwj > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250683#comment-16250683 ] Duo Zhang commented on HBASE-19216: --- Ping [~stack]. Need to know the state of procedure v2 and if there is already a plan to improve it? If no I will try to fix it by myself. Thanks. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248858#comment-16248858 ] Duo Zhang commented on HBASE-19216: --- Seems the procedure framework is not done yet? I followed the AssignRegionProcedure to write my RefreshPeerConfigProcedure, and then I found that, there is no way to get the response for a RemoteProcedure call! [~stack] Is there an existing issue to track this sir? The synchronous replication will depend on the procedure framework very very heavily, so I need to the current state... I need to make it done first before implementing synchronous replication. Thanks. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245639#comment-16245639 ] Duo Zhang commented on HBASE-19216: --- Need to be familiar with the procedure code. Seems a ReloadReplicationPeerConfigProcedure is enough for all replication related operations. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19216) Use procedure to execute replication peer related operations
[ https://issues.apache.org/jira/browse/HBASE-19216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243837#comment-16243837 ] Duo Zhang commented on HBASE-19216: --- I think this is independent from synchronous replication so create a separated issue for it. > Use procedure to execute replication peer related operations > > > Key: HBASE-19216 > URL: https://issues.apache.org/jira/browse/HBASE-19216 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > When building the basic framework for HBASE-19064, I found that the > enable/disable peer is built upon the watcher of zk. > The problem of using watcher is that, you do not know the exact time when all > RSes in the cluster have done the change, it is a 'eventually done'. > And for synchronous replication, when changing the state of a replication > peer, we need to know the exact time as we can only enable read/write after > that time. So I think we'd better use procedure to do this. Change the flag > on zk, and then execute a procedure on all RSes to reload the flag from zk. > Another benefit is that, after the change, zk will be mainly used as a > storage, so it will be easy to implement another replication peer storage to > replace zk so that we can reduce the dependency on zk. -- This message was sent by Atlassian JIRA (v6.4.14#64029)