[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces

2022-07-27 Thread Vladislav Pyatkov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571937#comment-17571937
 ] 

Vladislav Pyatkov commented on IGNITE-17252:


Merged to ignite3_tx

> Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener 
> interfaces
> --
>
> Key: IGNITE-17252
> URL: https://issues.apache.org/jira/browse/IGNITE-17252
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3, transaction3_rw
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> h2. General context
> According to tx design document new abstraction is introduced to encapsulate 
> replication engine (e.g. Raft) from business logic, called {*}primary 
> replica{*}:
> {code:java}
> A primary replica is a replica which serves a special purpose in the 
> transaction protocol.Only one primary replica can exist at a time. Each 
> replica is identified by liveness interval (startTs, endTs). All such 
> intervals are disjoint, so the new primary replica liveness interval can’t 
> overlap with the previous. Timestamps used for defining the intervals must be 
> comparable with timestamps assigned to committing transactions. For example, 
> HLC timestamps can be used for this purpose.
> Primary replica is used to execute CC protocol (so all reads and writes go 
> through it), thus maintaining serializable executions, as described in the 
> next section.
> The simplest implementation would be piggy-backing to RAFT protocol for tying 
> a primary replica to a RAFT leader. See the leaseholder section from the RAFT 
> paper for details. For this approach a RAFT leader is identical to a primary 
> replica node. The endTs is constantly extended using RAFT heart beating. 
> A primary replica’s status can be voluntarily transferred to another replica. 
> This is only possible after its liveness interval expires. This can be 
> useful, for example, for RAFT leaders balancing. {code}
> Besides obvious lease-based disjoint replication leader detection, primary 
> replica is also responsible for handling messages acting as a storage and 
> replication pre-and-post-processor. It's up to replica to
>  * acquire, release and await locks
>  * propagate requests to storage directly
>  * convert message to an appropriate replication(Raft) command and propagate 
> it to the replication engine.
> Let's check following example:
> *As-Is (currently):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> raftService.run(upsertCommand)
> raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand))
>                 messagingService().invoke(actionRequest)
> // server-side
> ActionRequestProcessor.handleRequest(actionRequest)
>     future = 
> JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command());
>  // Lock management
>     future.handle(actionRequest.command() instanceof WriteCommand ? 
> applyWrite(actionRequest) : applyRead(actionRequest)){code}
> Please pay attention to *onBeforeApply* step. It was introduced in order to 
> manage(acquire) locks with further locks awaiting *outside* the raft. It is 
> critical not to occupy the linearized in-raft execution with such lengthy 
> operations as waiting for locks to be released.
> It worth to mention, that such approach has several disadvantages, e.g. 
> onBeforeApply step is executed before isLeader() check, so that, it might 
> acquire lock on  non-leader-node that is not the expected behavior.
> *To-Be (should be implemented):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> replicaService.invoke(upsertRequest, primary=true)
> // server-side
> Replica.handleRequest(actionRequest)
>if (actionRequest.isPrimaryEvaluationExpected())
>         checkLease(); // Return failure if not valid
>
>if (actionRequest instanceOf WriteRequest) {
>   // validate writeRequest locally
>   
>   // acquire all locks !locally!
>   fut = txManager.intentWriteLock(table);
>   
>   fut.handle(()-> 
>  return 
> future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest
>   )
>}{code}
> in other word:
>  * Instead of raftGroupService, replicaService should be used.
>  * ReplicaService uses messages (actionRequests) instead of raft commands.
>  * Within the scope of RW transactions replicaService always sends requests 
> to the *primary* replica, however within the RO transactions non-primary 
> replicas will also participate in requests handling, so that I believe we 
> should introduce common Replica instead of strict PrimaryReplica.
>  * Replica is aware of

[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces

2022-07-27 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571931#comment-17571931
 ] 

Alexander Lapin commented on IGNITE-17252:
--

[~v.pyatkov] LGTM to feature branch.

> Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener 
> interfaces
> --
>
> Key: IGNITE-17252
> URL: https://issues.apache.org/jira/browse/IGNITE-17252
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3, transaction3_rw
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> h2. General context
> According to tx design document new abstraction is introduced to encapsulate 
> replication engine (e.g. Raft) from business logic, called {*}primary 
> replica{*}:
> {code:java}
> A primary replica is a replica which serves a special purpose in the 
> transaction protocol.Only one primary replica can exist at a time. Each 
> replica is identified by liveness interval (startTs, endTs). All such 
> intervals are disjoint, so the new primary replica liveness interval can’t 
> overlap with the previous. Timestamps used for defining the intervals must be 
> comparable with timestamps assigned to committing transactions. For example, 
> HLC timestamps can be used for this purpose.
> Primary replica is used to execute CC protocol (so all reads and writes go 
> through it), thus maintaining serializable executions, as described in the 
> next section.
> The simplest implementation would be piggy-backing to RAFT protocol for tying 
> a primary replica to a RAFT leader. See the leaseholder section from the RAFT 
> paper for details. For this approach a RAFT leader is identical to a primary 
> replica node. The endTs is constantly extended using RAFT heart beating. 
> A primary replica’s status can be voluntarily transferred to another replica. 
> This is only possible after its liveness interval expires. This can be 
> useful, for example, for RAFT leaders balancing. {code}
> Besides obvious lease-based disjoint replication leader detection, primary 
> replica is also responsible for handling messages acting as a storage and 
> replication pre-and-post-processor. It's up to replica to
>  * acquire, release and await locks
>  * propagate requests to storage directly
>  * convert message to an appropriate replication(Raft) command and propagate 
> it to the replication engine.
> Let's check following example:
> *As-Is (currently):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> raftService.run(upsertCommand)
> raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand))
>                 messagingService().invoke(actionRequest)
> // server-side
> ActionRequestProcessor.handleRequest(actionRequest)
>     future = 
> JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command());
>  // Lock management
>     future.handle(actionRequest.command() instanceof WriteCommand ? 
> applyWrite(actionRequest) : applyRead(actionRequest)){code}
> Please pay attention to *onBeforeApply* step. It was introduced in order to 
> manage(acquire) locks with further locks awaiting *outside* the raft. It is 
> critical not to occupy the linearized in-raft execution with such lengthy 
> operations as waiting for locks to be released.
> It worth to mention, that such approach has several disadvantages, e.g. 
> onBeforeApply step is executed before isLeader() check, so that, it might 
> acquire lock on  non-leader-node that is not the expected behavior.
> *To-Be (should be implemented):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> replicaService.invoke(upsertRequest, primary=true)
> // server-side
> Replica.handleRequest(actionRequest)
>if (actionRequest.isPrimaryEvaluationExpected())
>         checkLease(); // Return failure if not valid
>
>if (actionRequest instanceOf WriteRequest) {
>   // validate writeRequest locally
>   
>   // acquire all locks !locally!
>   fut = txManager.intentWriteLock(table);
>   
>   fut.handle(()-> 
>  return 
> future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest
>   )
>}{code}
> in other word:
>  * Instead of raftGroupService, replicaService should be used.
>  * ReplicaService uses messages (actionRequests) instead of raft commands.
>  * Within the scope of RW transactions replicaService always sends requests 
> to the *primary* replica, however within the RO transactions non-primary 
> replicas will also participate in requests handling, so that I believe we 
> should introduce common Replica instead of strict PrimaryReplica.
>  * Replica

[jira] [Commented] (IGNITE-17252) Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener interfaces

2022-07-14 Thread Alexander Lapin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566740#comment-17566740
 ] 

Alexander Lapin commented on IGNITE-17252:
--

[~v.pyatkov] I've added some comments in PR.

> Introduce Replica, ReplicaServer(?), ReplicaService and ReplicaListener 
> interfaces
> --
>
> Key: IGNITE-17252
> URL: https://issues.apache.org/jira/browse/IGNITE-17252
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexander Lapin
>Assignee: Vladislav Pyatkov
>Priority: Major
>  Labels: ignite-3, transaction3_rw
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> h2. General context
> According to tx design document new abstraction is introduced to encapsulate 
> replication engine (e.g. Raft) from business logic, called {*}primary 
> replica{*}:
> {code:java}
> A primary replica is a replica which serves a special purpose in the 
> transaction protocol.Only one primary replica can exist at a time. Each 
> replica is identified by liveness interval (startTs, endTs). All such 
> intervals are disjoint, so the new primary replica liveness interval can’t 
> overlap with the previous. Timestamps used for defining the intervals must be 
> comparable with timestamps assigned to committing transactions. For example, 
> HLC timestamps can be used for this purpose.
> Primary replica is used to execute CC protocol (so all reads and writes go 
> through it), thus maintaining serializable executions, as described in the 
> next section.
> The simplest implementation would be piggy-backing to RAFT protocol for tying 
> a primary replica to a RAFT leader. See the leaseholder section from the RAFT 
> paper for details. For this approach a RAFT leader is identical to a primary 
> replica node. The endTs is constantly extended using RAFT heart beating. 
> A primary replica’s status can be voluntarily transferred to another replica. 
> This is only possible after its liveness interval expires. This can be 
> useful, for example, for RAFT leaders balancing. {code}
> Besides obvious lease-based disjoint replication leader detection, primary 
> replica is also responsible for handling messages acting as a storage and 
> replication pre-and-post-processor. It's up to replica to
>  * acquire, release and await locks
>  * propagate requests to storage directly
>  * convert message to an appropriate replication(Raft) command and propagate 
> it to the replication engine.
> Let's check following example:
> *As-Is (currently):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> raftService.run(upsertCommand)
> raftGroupService.sendWithRetry(ActionRequest.of(upsertCommand))
>                 messagingService().invoke(actionRequest)
> // server-side
> ActionRequestProcessor.handleRequest(actionRequest)
>     future = 
> JraftServerImpl.DelegatingStateMachine.getListener().onBeforeApply(request.command());
>  // Lock management
>     future.handle(actionRequest.command() instanceof WriteCommand ? 
> applyWrite(actionRequest) : applyRead(actionRequest)){code}
> Please pay attention to *onBeforeApply* step. It was introduced in order to 
> manage(acquire) locks with further locks awaiting *outside* the raft. It is 
> critical not to occupy the linearized in-raft execution with such lengthy 
> operations as waiting for locks to be released.
> It worth to mention, that such approach has several disadvantages, e.g. 
> onBeforeApply step is executed before isLeader() check, so that, it might 
> acquire lock on  non-leader-node that is not the expected behavior.
> *To-Be (should be implemented):*
> {code:java}
> // client-side
> InternalTable.upsert()
> enlistInTx()
> replicaService.invoke(upsertRequest, primary=true)
> // server-side
> Replica.handleRequest(actionRequest)
>if (actionRequest.isPrimaryEvaluationExpected())
>         checkLease(); // Return failure if not valid
>
>if (actionRequest instanceOf WriteRequest) {
>   // validate writeRequest locally
>   
>   // acquire all locks !locally!
>   fut = txManager.intentWriteLock(table);
>   
>   fut.handle(()-> 
>  return 
> future.of(async(replicationEngine.replicate(ReplicationCommand.of(writeRequest
>   )
>}{code}
> in other word:
>  * Instead of raftGroupService, replicaService should be used.
>  * ReplicaService uses messages (actionRequests) instead of raft commands.
>  * Within the scope of RW transactions replicaService always sends requests 
> to the *primary* replica, however within the RO transactions non-primary 
> replicas will also participate in requests handling, so that I believe we 
> should introduce common Replica instead of strict PrimaryReplica.
>  * Repl