[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces
[ https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571937#comment-17571937 ] Vladislav Pyatkov commented on IGNITE-17252: Merged to ignite3_tx > Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener > interfaces > -- > > Key: IGNITE-17252 > URL: https://issues.apache.org/jira/browse/IGNITE-17252 > Project: Ignite > Issue Type: Improvement >Reporter: Alexander Lapin >Assignee: Vladislav Pyatkov >Priority: Major > Labels: ignite-3, transaction3_rw > Time Spent: 12h 10m > Remaining Estimate: 0h > > h2. General context > According to tx design document new abstraction is introduced to encapsulate > replication engine (e.g. Raft) from business logic, called {*}primary > replica{*}: > {code:java} > A primary replica is a replica which serves a special purpose in the > transaction protocol.Only one primary replica can exist at a time. Each > replica is identified by liveness interval (startTs, endTs). All such > intervals are disjoint, so the new primary replica liveness interval can’t > overlap with the previous. Timestamps used for defining the intervals must be > comparable with timestamps assigned to committing transactions. For example, > HLC timestamps can be used for this purpose. > Primary replica is used to execute CC protocol (so all reads and writes go > through it), thus maintaining serializable executions, as described in the > next section. > The simplest implementation would be piggy-backing to RAFT protocol for tying > a primary replica to a RAFT leader. See the leaseholder section from the RAFT > paper for details. For this approach a RAFT leader is identical to a primary > replica node. The endTs is constantly extended using RAFT heart beating. > A primary replica’s status can be voluntarily transferred to another replica. > This is only possible after its liveness interval expires. This can be > useful, for example, for RAFT leaders balancing. {code} > Besides obvious lease-based disjoint replication leader detection, primary > replica is also responsible for handling messages acting as a storage and > replication pre-and-post-processor. It's up to replica to > * acquire, release and await locks > * propagate requests to storage directly > * convert message to an appropriate replication(Raft) command and propagate > it to the replication engine. > Let's check following example: > *As-Is (currently):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > raftService.run(upsertCommand) > raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand)) > messagingService().invoke(actionRequest) > // server-side > ActionRequestProcessor.handleRequest(actionRequest) > future = > JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command()); > // Lock management > future.handle(actionRequest.command() instanceof WriteCommand ? > applyWrite(actionRequest) : applyRead(actionRequest)){code} > Please pay attention to *onBeforeApply* step. It was introduced in order to > manage(acquire) locks with further locks awaiting *outside* the raft. It is > critical not to occupy the linearized in-raft execution with such lengthy > operations as waiting for locks to be released. > It worth to mention, that such approach has several disadvantages, e.g. > onBeforeApply step is executed before isLeader() check, so that, it might > acquire lock on non-leader-node that is not the expected behavior. > *To-Be (should be implemented):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > replicaService.invoke(upsertRequest, primary=true) > // server-side > Replica.handleRequest(actionRequest) >if (actionRequest.isPrimaryEvaluationExpected()) > checkLease(); // Return failure if not valid > >if (actionRequest instanceOf WriteRequest) { > // validate writeRequest locally > > // acquire all locks !locally! > fut = txManager.intentWriteLock(table); > > fut.handle(()-> > return > future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest > ) >}{code} > in other word: > * Instead of raftGroupService, replicaService should be used. > * ReplicaService uses messages (actionRequests) instead of raft commands. > * Within the scope of RW transactions replicaService always sends requests > to the *primary* replica, however within the RO transactions non-primary > replicas will also participate in requests handling, so that I believe we > should introduce common Replica instead of strict PrimaryReplica. > * Replica is aware of
[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces
[ https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571931#comment-17571931 ] Alexander Lapin commented on IGNITE-17252: -- [~v.pyatkov] LGTM to feature branch. > Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener > interfaces > -- > > Key: IGNITE-17252 > URL: https://issues.apache.org/jira/browse/IGNITE-17252 > Project: Ignite > Issue Type: Improvement >Reporter: Alexander Lapin >Assignee: Vladislav Pyatkov >Priority: Major > Labels: ignite-3, transaction3_rw > Time Spent: 11h 50m > Remaining Estimate: 0h > > h2. General context > According to tx design document new abstraction is introduced to encapsulate > replication engine (e.g. Raft) from business logic, called {*}primary > replica{*}: > {code:java} > A primary replica is a replica which serves a special purpose in the > transaction protocol.Only one primary replica can exist at a time. Each > replica is identified by liveness interval (startTs, endTs). All such > intervals are disjoint, so the new primary replica liveness interval can’t > overlap with the previous. Timestamps used for defining the intervals must be > comparable with timestamps assigned to committing transactions. For example, > HLC timestamps can be used for this purpose. > Primary replica is used to execute CC protocol (so all reads and writes go > through it), thus maintaining serializable executions, as described in the > next section. > The simplest implementation would be piggy-backing to RAFT protocol for tying > a primary replica to a RAFT leader. See the leaseholder section from the RAFT > paper for details. For this approach a RAFT leader is identical to a primary > replica node. The endTs is constantly extended using RAFT heart beating. > A primary replica’s status can be voluntarily transferred to another replica. > This is only possible after its liveness interval expires. This can be > useful, for example, for RAFT leaders balancing. {code} > Besides obvious lease-based disjoint replication leader detection, primary > replica is also responsible for handling messages acting as a storage and > replication pre-and-post-processor. It's up to replica to > * acquire, release and await locks > * propagate requests to storage directly > * convert message to an appropriate replication(Raft) command and propagate > it to the replication engine. > Let's check following example: > *As-Is (currently):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > raftService.run(upsertCommand) > raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand)) > messagingService().invoke(actionRequest) > // server-side > ActionRequestProcessor.handleRequest(actionRequest) > future = > JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command()); > // Lock management > future.handle(actionRequest.command() instanceof WriteCommand ? > applyWrite(actionRequest) : applyRead(actionRequest)){code} > Please pay attention to *onBeforeApply* step. It was introduced in order to > manage(acquire) locks with further locks awaiting *outside* the raft. It is > critical not to occupy the linearized in-raft execution with such lengthy > operations as waiting for locks to be released. > It worth to mention, that such approach has several disadvantages, e.g. > onBeforeApply step is executed before isLeader() check, so that, it might > acquire lock on non-leader-node that is not the expected behavior. > *To-Be (should be implemented):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > replicaService.invoke(upsertRequest, primary=true) > // server-side > Replica.handleRequest(actionRequest) >if (actionRequest.isPrimaryEvaluationExpected()) > checkLease(); // Return failure if not valid > >if (actionRequest instanceOf WriteRequest) { > // validate writeRequest locally > > // acquire all locks !locally! > fut = txManager.intentWriteLock(table); > > fut.handle(()-> > return > future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest > ) >}{code} > in other word: > * Instead of raftGroupService, replicaService should be used. > * ReplicaService uses messages (actionRequests) instead of raft commands. > * Within the scope of RW transactions replicaService always sends requests > to the *primary* replica, however within the RO transactions non-primary > replicas will also participate in requests handling, so that I believe we > should introduce common Replica instead of strict PrimaryReplica. > * Replica
[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces
[ https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566740#comment-17566740 ] Alexander Lapin commented on IGNITE-17252: -- [~v.pyatkov] I've added some comments in PR. > Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener > interfaces > -- > > Key: IGNITE-17252 > URL: https://issues.apache.org/jira/browse/IGNITE-17252 > Project: Ignite > Issue Type: Improvement >Reporter: Alexander Lapin >Assignee: Vladislav Pyatkov >Priority: Major > Labels: ignite-3, transaction3_rw > Time Spent: 1h > Remaining Estimate: 0h > > h2. General context > According to tx design document new abstraction is introduced to encapsulate > replication engine (e.g. Raft) from business logic, called {*}primary > replica{*}: > {code:java} > A primary replica is a replica which serves a special purpose in the > transaction protocol.Only one primary replica can exist at a time. Each > replica is identified by liveness interval (startTs, endTs). All such > intervals are disjoint, so the new primary replica liveness interval can’t > overlap with the previous. Timestamps used for defining the intervals must be > comparable with timestamps assigned to committing transactions. For example, > HLC timestamps can be used for this purpose. > Primary replica is used to execute CC protocol (so all reads and writes go > through it), thus maintaining serializable executions, as described in the > next section. > The simplest implementation would be piggy-backing to RAFT protocol for tying > a primary replica to a RAFT leader. See the leaseholder section from the RAFT > paper for details. For this approach a RAFT leader is identical to a primary > replica node. The endTs is constantly extended using RAFT heart beating. > A primary replica’s status can be voluntarily transferred to another replica. > This is only possible after its liveness interval expires. This can be > useful, for example, for RAFT leaders balancing. {code} > Besides obvious lease-based disjoint replication leader detection, primary > replica is also responsible for handling messages acting as a storage and > replication pre-and-post-processor. It's up to replica to > * acquire, release and await locks > * propagate requests to storage directly > * convert message to an appropriate replication(Raft) command and propagate > it to the replication engine. > Let's check following example: > *As-Is (currently):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > raftService.run(upsertCommand) > raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand)) > messagingService().invoke(actionRequest) > // server-side > ActionRequestProcessor.handleRequest(actionRequest) > future = > JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command()); > // Lock management > future.handle(actionRequest.command() instanceof WriteCommand ? > applyWrite(actionRequest) : applyRead(actionRequest)){code} > Please pay attention to *onBeforeApply* step. It was introduced in order to > manage(acquire) locks with further locks awaiting *outside* the raft. It is > critical not to occupy the linearized in-raft execution with such lengthy > operations as waiting for locks to be released. > It worth to mention, that such approach has several disadvantages, e.g. > onBeforeApply step is executed before isLeader() check, so that, it might > acquire lock on non-leader-node that is not the expected behavior. > *To-Be (should be implemented):* > {code:java} > // client-side > InternalTable.upsert() > enlistInTx() > replicaService.invoke(upsertRequest, primary=true) > // server-side > Replica.handleRequest(actionRequest) >if (actionRequest.isPrimaryEvaluationExpected()) > checkLease(); // Return failure if not valid > >if (actionRequest instanceOf WriteRequest) { > // validate writeRequest locally > > // acquire all locks !locally! > fut = txManager.intentWriteLock(table); > > fut.handle(()-> > return > future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest > ) >}{code} > in other word: > * Instead of raftGroupService, replicaService should be used. > * ReplicaService uses messages (actionRequests) instead of raft commands. > * Within the scope of RW transactions replicaService always sends requests > to the *primary* replica, however within the RO transactions non-primary > replicas will also participate in requests handling, so that I believe we > should introduce common Replica instead of strict PrimaryReplica. > * Repl