[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-03-10 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064984#comment-18064984
 ] 

Ivan Andika edited comment on RATIS-2403 at 3/11/26 4:00 AM:
-

Found a paper that might be useful 
[https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf]

Website is [https://law-theorem.com/]

It introduces Almost-Local Reads (ALRS) that have some kind of batching 
implementation, but seems to be batching reads on the follower instead of the 
batching writes on the leader (which is what we are pursuing). The paper claims 
it is linearizable with significant performance improvement (2.5x for Raft).

We can also check [https://github.com/vasigavr1/Odyssey] referenced in 
[https://law-theorem.com/artifact.pdf] 

Also attached initial LLM research analysis: [^LAW_THEOREM_RATIS_ANALYSIS.md]


was (Author: JIRAUSER298977):
Found a paper that might be useful 
[https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf]

Website is https://law-theorem.com/

It introduces Almost-Local Reads (ALRS) that have some kind of batching 
implementation, but seems to be batching reads on the follower instead of the 
batching writes on the leader (which is what we are pursuing). The paper claims 
it is linearizable with significant performance improvement (2.5x for Raft).

We can also check [https://github.com/vasigavr1/Odyssey] referenced in 
[https://law-theorem.com/artifact.pdf] 

> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>  Components: Linearizable Read
>Reporter: Ivan Andika
>Assignee: Ivan Andika
>Priority: Major
> Attachments: 1362_review.patch, LAW_THEOREM_RATIS_ANALYSIS.md, 
> leader-backpressure.patch, leader-batch-write.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-03-10 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064984#comment-18064984
 ] 

Ivan Andika edited comment on RATIS-2403 at 3/11/26 3:56 AM:
-

Found a paper that might be useful 
[https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf]

Website is https://law-theorem.com/

It introduces Almost-Local Reads (ALRS) that have some kind of batching 
implementation, but seems to be batching reads on the follower instead of the 
batching writes on the leader (which is what we are pursuing). The paper claims 
it is linearizable with significant performance improvement (2.5x for Raft).

We can also check [https://github.com/vasigavr1/Odyssey] referenced in 
[https://law-theorem.com/artifact.pdf] 


was (Author: JIRAUSER298977):
Found a paper that might be useful: 
[https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf]

It introduces Almost-Local Reads (ALRS) that have some kind of batching 
implementation, but seems to be batching reads on the follower instead of the 
batching writes on the leader (which is what we are pursuing). The paper claims 
it is linearizable with significant performance improvement (2.5x for Raft).

> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>  Components: Linearizable Read
>Reporter: Ivan Andika
>Assignee: Ivan Andika
>Priority: Major
> Attachments: 1362_review.patch, leader-backpressure.patch, 
> leader-batch-write.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-24 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060647#comment-18060647
 ] 

Ivan Andika edited comment on RATIS-2403 at 2/24/26 8:40 AM:
-

> Since it just delays the replies, the write throughput should not be 
> degraded. I think the problem is at the client side (i.e. the benchmark) – 
> the client may wait for the previous reply before sending another request. If 
> it is the case, using more clients should be able to keep the write 
> throughput.

Yeah, that makes sense, maybe the write throughput degradation issue can be 
reduced. However, since in Ozone case the handler thread handling writes will 
be blocked waiting for RaftServerImpl#submitClientRequestAsync to complete, we 
might also need to scale the handler.

> BTW, the code should replace the reference but not copying the list. Also, 
> use LinkedList for avoiding ArrayList resizing.

Thanks for the suggestion. The patch is mostly generated by LLM since it's 
currently for PoC, I will try to refine it further. If you think this can be 
safely pushed upstream, I can raise a PR for this.


was (Author: JIRAUSER298977):
> Since it just delays the replies, the write throughput should not be 
> degraded. I think the problem is at the client side (i.e. the benchmark) – 
> the client may wait for the previous reply before sending another request. If 
> it is the case, using more clients should be able to keep the write 
> throughput.

Yeah, that makes sense, maybe the write throughput degradation issue can be 
reduced. However, since in Ozone case the handler thread handling writes will 
be blocked waiting for RaftServerImpl#submitClientRequestAsync to complete, we 
might also need to scale the handler.

> BTW, the code should replace the reference but not copying the list. Also, 
> use LinkedList for avoiding ArrayList resizing.

Thanks for the suggestion. This is generated by LLM since it's currently for 
PoC, I will try to refine it further (e.g. by using AwaitForSignal instead of 
sleeping). If you think this can be safely pushed upstream, I can raise a PR 
for this.

> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-23 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060444#comment-18060444
 ] 

Tsz-wo Sze edited comment on RATIS-2403 at 2/23/26 7:16 PM:


[~ivanandika], thanks for trying out the leader-batch-write!

bq. ...  but the write throughput can be degraded to 0.7x (one case degrades to 
0.3x),  ...

Since it just delays the replies, the write throughput should not be degraded.  
I think the problem is at the client side (i.e. the benchmark) -- the client 
may wait for the previous reply before sending another request.  If it is the 
case, using more clients should be able to keep the write throughput.

It does make sense if the latency is degraded.

BTW, the code should replace the reference but not copying the list.  Also, use 
LinkedList for avoiding ArrayList resizing.
{code}
//LeaderStateImpl
  private final AtomicReference> heldReplies = ...
  ...
  private void flushReplies() {
if (heldReplies.get().isEmpty()) {
  return;
}
final List toFlush = heldReplies.getAndSet(new LinkedList<>());
...
   }
{code}



was (Author: szetszwo):

[~ivanandika], thanks for trying out the leader-batch-write!

bq. ...  but the write throughput can be degraded to 0.7x (one case degrades to 
0.3x),  ...

Since it just delays the replies, the write throughput should not be degraded.  
I think the problem is at the client side (i.e. the benchmark) -- the client 
may wait for the previous reply before sending another request.

It makes sense if the latency is degraded.

BTW, the code should replace the reference but not copying the list.  Also, use 
LinkedList for avoiding ArrayList resizing.
{code}
//LeaderStateImpl
  private final AtomicReference> heldReplies = ...
  ...
  private void flushReplies() {
if (heldReplies.get().isEmpty()) {
  return;
}
final List toFlush = heldReplies.getAndSet(new LinkedList<>());
...
   }
{code}


> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-23 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060285#comment-18060285
 ] 

Ivan Andika edited comment on RATIS-2403 at 2/23/26 10:00 AM:
--

Thank you [~tanxinyu]  and apologies for the late reply. Happy CNY, I was also 
celebrating CNY last week.
{quote}*1. Regarding follower read usage in IoTDB*
{quote}
Thanks for the insight regarding the usage of linearizable read on IoTDB. The 
control plane and data plane setup is similar to Ozone's OM and Datanodes, 
where the control plane contains a single Raft group and the datanodes uses 
multi-raft setup to distribute load. Currently, we are encountering bottleneck 
on the control nodes (some high RPC queue issues on the OM leader mostly due to 
the lock contentions (e.g. multiple keys operations need to get the same bucket 
lock) which blocks caller threads). We intend to enable follower read to try to 
improve the read throughput of OM, but it has the unintended effect of reducing 
read throughput while increasing write throughput. For the data planes, similar 
to IoTDB we also uses multi-Raft (a single Raft group is called a "Ratis 
Pipeline") to balance load at the data-plane level, but unlike IoTDB the reads 
on data-plane do not use Ratis to read the data, instead we use an internal 
sequence ID / version (called Block Sequence ID) which is correspond to the 
Raft log index, when user wants to read the key, it will check the BCSID stored 
in the meta nodes and compare with the ones in the datanode replica, if it's 
mismatched it means that the datanode replica is stale and client will pick 
another node. The Raft groups in Ozone pipelines are also relatively 
short-lived, where it can be destroyed and recreated in response to failures, 
migration, etc.
{quote}*2. On the Ozone scenario*
{quote}
> Is it because write throughput becomes higher after enabling follower read?

Yes, this is the current behavior I observed.
{quote}{*}3. On leader throttling approaches{*}{*}{{*}}
{quote}
Yes, the maximum gap solution has a lot of issues (as you mentioned) and only 
serves as a way to prove my hypothesis (that there is a inverse relationship 
between write throughput and read throughput). I have attached the approach 
suggested by [~szetszwo] [^leader-batch-write.patch] . Although it performs 
better than the leader throttling approaches, it is also inflexible since we 
don't know how frequent we need to batch write get the desired read and write 
performance and the interval cannot dynamically adjust based on the current 
workloads (read-heavy or write-heavy). I tested that 10ms improved the read 
throughput 1.3-1.7x, but the write throughput can be degraded to 0.7x (one case 
degrades to 0.3x), although there is one case where read and write throughput 
improved by 1.x to 1.4x. When I decreased it to 5ms, the read QPS throughput 
degraded again for write-heavy cases (although there are one case where both 
read and write throughput improved by 1.2x-1.3x). Therefore, these solutions 
require periodic and careful tuning which is not ideal and therefore might not 
be good to be pushed to Ratis upstream.

I had some thoughts regarding the this issue and identified issues in my 
benchmark setup. I use 100 clients on both the leader read setup (baseline) and 
the follower-read setup (client can read from both leader and followers). The 
realization is that 100 clients follower read will never be better than 100 
client leader read unless the leader is already saturated (high resource usage 
or RPC queue latency is higher than readindex) since leader will respond almost 
immediately but follower will need to wait (even 1-2ms is a lot higher than 
immediate return). A better setup for the follower reads might setup two 
concurrent setups 1) 100 clients for leader reads 2) 200 clients for follower 
reads, each 100 clients are assigned to each follower (assuming 2 followers 1 
leader) so that each node is connected by 100 clients. If we can verify that 
the leader is not degraded when we added the additional 200 clients to 
exclusively to the followers, this means that follower read works when there 
are new read throughput. That said, this also means that we cannot simply 
enable follower reads on the existing workload since the existing workload read 
throughput might be degraded. I will benchmark and validate this.

Another problem is that the benchmark only uses a single client which should 
make the Ozone / Hadoop Rate Limiting ineffective. Ozone / Hadoop rate limiting 
will deprioritize users (identified by username) to lower queue if they send a 
lot requests (which is weighted based on things like how long the request holds 
exclusive / write or shared / read locks), but if there is only one user, these 
QoS will not work. Therefore, I'll try to change my benchmark to have different 
users to see whether th

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-13 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058105#comment-18058105
 ] 

Ivan Andika edited comment on RATIS-2403 at 2/13/26 10:01 AM:
--

[~tanxinyu] Thanks for the feedback and ideas. 

FYI my current benchmark setup:
 * Setup the baseline (leader only read and write)
 * Each benchmark is setup to have the following write/read workloads, ranging 
from write only to read only
 ** 100% Write
 ** 100% Read
 ** 10% Write, 10% Read
 ** 30% Write, 70% Read
 ** 90% Write, 10% Read
 * There will be 100 client threads, with the following configuration
 ** Random: Each client thread picks a random node (can be leader and follower)
 ** Follower only: Each client thread picks only follower

Regarding the high pressure or saturation, currently Ozone Manager is not able 
to hit the physical resources limitation (CPU, I/O, Network) since it's 
protected by backpressure mechanism like RPC queue, RPC handlers and 
synchronization mechanism like key lock. However, when enabling follower reads, 
even when there  are separate RPC queues and RPC handlers and less lock 
contention (since OM nodes do not share locks), the read throughput suffer 
quite a bit. However, when I tried to throttle the write requests, the read 
requests improved dramatically. I understand that if we push the leader to its 
limit, offloading any additional loads to follower should make the overall Raft 
group to be able to handle more throughput. Nonetheless, we also want to ensure 
that there are no throughput degradation in normal case.

Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue 
implementation 
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
 where each user requests is weighted based on things like how long it holds 
the lock, etc. The user will then be deprioritized to lower queue so that (for 
example for every 1 requests served in lower queue 2, 2 requests are served in 
higher queue 1). I tried to rate limit writes and it does yields good 
improvement on read result (while writes now degrades), but the issue my 
current rate limiting is not flexible and might regress if there is a workload 
changes (e.g. more writes).

Let me try to replicate your methodology to see if it can uncover other 
bottlenecks.

[~tanxinyu] Btw, can I check whether linearizable follower read (with / without 
lease) has been widely used in production for IoTDB?  If it does, this means 
that the implementation is already production-ready and the bottleneck might be 
on Ozone-side. It will be great if you have some blogs or links regarding the 
benchmark when compared to leader-only workloads so we can see the expected 
speedup (currently my target is 1.5x-2x read throughput increase with no write 
throughput degradation).


was (Author: JIRAUSER298977):
[~tanxinyu] Thanks for the feedback and ideas. 

FYI my current benchmark setup:
 * Setup the baseline (leader only read and write)
 * Each benchmark is setup to have the following write/read workloads, ranging 
from write only to read only
 ** 100% Write
 ** 100% Read
 ** 10% Write, 10% Read
 ** 30% Write, 70% Read
 ** 90% Write, 10% Read
 * There will be 100 client threads, with the following configuration
 ** Random: Each client thread picks a random node (can be leader and follower)
 ** Follower only: Each client thread picks only follower

Regarding the high pressure or saturation, currently Ozone Manager is not able 
to hit the physical resources limitation (CPU, I/O, Network) since it's 
protected by backpressure mechanism like RPC queue, RPC handlers and 
synchronization mechanism like key lock. However, when enabling follower reads, 
even when there  are separate RPC queues and RPC handlers and less lock 
contention (since OM nodes do not share locks), the read throughput suffer 
quite a bit. However, when I tried to throttle the write requests, the read 
requests improved dramatically. 

Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue 
implementation 
([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)]
 where each user requests is weighted based on things like how long it holds 
the lock, etc. The user will then be deprioritized to lower queue so that (for 
example for every 1 requests served in lower queue 2, 2 requests are served in 
higher queue 1). I tried to rate limit writes and it does yields good 
improvement on read result (while writes now degrades), but the issue my 
current rate limiting is not flexible and might regress if there is a workload 
changes (e.g. more writes).

Let me try to replicate your methodology to see if it can uncover other 
bottlenecks.

[~tanxinyu] Btw, can I check whether linearizable follower read (with / without 
lease) has been widely used in production for IoTDB?  If 

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Xinyu Tan (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029
 ] 

Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:04 AM:
---

[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 * Effectiveness of Follower read: We can define effectiveness through the 
following experimental model:
 * 
 ** Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 ** Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 ** Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 ** Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.
 * Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would allow us to explicitly define a resource ceiling for 
writes within this trade-off, leaving guaranteed headroom for query throughput.
 * Enhance Observability and Identify Optimization Opportunities:I recommend 
analyzing disk IOPS, network bandwidth, and CPU flame graphs during future 
stress tests. Quantitative data—such as WAL write latency, gRPC serialization 
overhead, or state machine lock contention—is critical for pinpointing 
bottlenecks. For instance, in a previous optimization where I simply batched 
the put operations for the WAL blocking queue, I managed to save nearly 20% of 
CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, 
we can uncover many more similar optimization opportunities in Ratis by using 
profiling tools.


was (Author: tanxinyu):
[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 # Effectiveness of Follower read: We can define effectiveness through the 
following experimental model:

 * Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 * Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 * Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 * Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.

 # Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This w

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Xinyu Tan (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029
 ] 

Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:04 AM:
---

[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 * Effectiveness of Follower read: We can define effectiveness through the 
following experimental model:
 ** Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 ** Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 ** Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 ** Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.
 * Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would allow us to explicitly define a resource ceiling for 
writes within this trade-off, leaving guaranteed headroom for query throughput.
 * Enhance Observability and Identify Optimization Opportunities:I recommend 
analyzing disk IOPS, network bandwidth, and CPU flame graphs during future 
stress tests. Quantitative data—such as WAL write latency, gRPC serialization 
overhead, or state machine lock contention—is critical for pinpointing 
bottlenecks. For instance, in a previous optimization where I simply batched 
the put operations for the WAL blocking queue, I managed to save nearly 20% of 
CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, 
we can uncover many more similar optimization opportunities in Ratis by using 
profiling tools.


was (Author: tanxinyu):
[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 * Effectiveness of Follower read: We can define effectiveness through the 
following experimental model:
 * 
 ** Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 ** Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 ** Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 ** Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.
 * Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Xinyu Tan (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029
 ] 

Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:03 AM:
---

[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 # Effectiveness of Follower read:

We can define effectiveness through the following experimental model:
 ** Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 ** Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 ** Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 ** Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.
 # Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would allow us to explicitly define a resource ceiling for 
writes within this trade-off, leaving guaranteed headroom for query throughput.
 * Enhance Observability and Identify Optimization Opportunities:I recommend 
analyzing disk IOPS, network bandwidth, and CPU flame graphs during future 
stress tests. Quantitative data—such as WAL write latency, gRPC serialization 
overhead, or state machine lock contention—is critical for pinpointing 
bottlenecks. For instance, in a previous optimization where I simply batched 
the put operations for the WAL blocking queue, I managed to save nearly 20% of 
CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, 
we can uncover many more similar optimization opportunities in Ratis by using 
profiling tools.


was (Author: tanxinyu):
[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 # Effectiveness of Follower read:
Under a fixed load where the Leader has not yet reached its physical 
bottleneck, we need to verify whether enabling Follower Read effectively 
balances the query load across all replicas. If we observe a query throughput 
increase nearly proportional to the number of replicas while write throughput 
remains stable, it proves that the current Follower Read mechanism in Ratis is 
effective in offloading the Leader.
 # Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would allow us to explicitly define a resource ceiling for 
writes within this trade-off, leaving guaranteed headroom for query throughput.
 * Enhance Observability and Identify Optimization Opportunities:I recommend 
analyzing disk IOPS, network bandwidth, and CPU flame graphs during future 
stress tests. Quantitative data—such as WAL write latency, gRPC serialization 
overhead, or state machine lock contention—is critical for pinpointing 
bottl

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Xinyu Tan (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029
 ] 

Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:03 AM:
---

[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 # Effectiveness of Follower read: We can define effectiveness through the 
following experimental model:

 * Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 * Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 * Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 * Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.

 # Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would allow us to explicitly define a resource ceiling for 
writes within this trade-off, leaving guaranteed headroom for query throughput.
 * Enhance Observability and Identify Optimization Opportunities:I recommend 
analyzing disk IOPS, network bandwidth, and CPU flame graphs during future 
stress tests. Quantitative data—such as WAL write latency, gRPC serialization 
overhead, or state machine lock contention—is critical for pinpointing 
bottlenecks. For instance, in a previous optimization where I simply batched 
the put operations for the WAL blocking queue, I managed to save nearly 20% of 
CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, 
we can uncover many more similar optimization opportunities in Ratis by using 
profiling tools.


was (Author: tanxinyu):
[~ivanandika]
I believe the core issue of balancing read/write performance needs to be 
examined from two distinct perspectives:
 # Effectiveness of Follower read:

We can define effectiveness through the following experimental model:
 ** Baseline: Both read and write loads are low, and the system has no resource 
bottlenecks.
 ** Introducing a Bottleneck: Increase the query load on the Leader by N times 
(number of replicas), forcing the Leader into a bottlenecked state. At this 
point, write throughput will inevitably drop due to resource contention, and 
query throughput will also be capped.
 ** Enabling Follower Read: Distribute these additional query loads evenly 
across all replicas.
 ** Expected Outcome: If write performance returns to its baseline state (no 
longer interfered with by queries) while the total query throughput increases 
significantly without hitting physical limits across the group, it proves that 
Ratis’s Follower Read is truly effective in offloading the Leader and 
decoupling resources.
 # Resource Zero-Sum Game under High Pressure:
In extreme stress-test scenarios, the total read/write capacity of the 
consensus group is capped by physical resources (CPU, I/O, Network).
A clear example is: even without Follower Read, if we pin all reads and writes 
to the Leader until it bottlenecks, any further increase in write load will 
inevitably decrease query throughput. This phenomenon persists even with 
Follower Read enabled, as faster writes force Followers to consume more 
resources for log synchronization and application (Apply), which in turn 
encroaches on query resources.

Based on the above analysis, I suggest the following directions:
 * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm 
cannot change the total resource limit. To fundamentally address the mutual 
interference between reads and writes, we might need a rate-limiting mechanism 
for writes. This would

[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057756#comment-18057756
 ] 

Ivan Andika edited comment on RATIS-2403 at 2/11/26 8:23 AM:
-

[~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me 
think how to approach this.

FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote 
some prototype to block append log if the follower commit is behind the leader 
and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with 
raft.server.write.follower.gap.ratio.max=1 . The result is that read is still 
within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing 
writes would speed up the reads. However, this solution is not flexible enough 
and should not be deployed.

LLM also tried to approach it by implementing SingleFlight pattern by making 
multiple follower read requests to share one ReadIndex RPC. So the later 
follower read requests will join the ongoing ReadIndex RPC. However, this seems 
to be invalid and it will violate linearizability since the ReadIndex might 
already be stale for the latter requests.


was (Author: JIRAUSER298977):
[~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me 
think how to approach this.

FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote 
some prototype of block append log and benchmarked it 
([^leader-backpressure.patch]). I benchmarked it with 
raft.server.write.follower.gap.ratio.max=1 . The result is that read is still 
within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing 
writes would speed up the reads. However, this solution is not flexible enough 
and should not be deployed.

LLM also tried to approach it by implementing SingleFlight pattern by making 
multiple follower read requests to share one ReadIndex RPC. So the later 
follower read requests will join the ongoing ReadIndex RPC. However, this seems 
to be invalid and it will violate linearizability since the ReadIndex might 
already be stale for the latter requests.

> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes

2026-02-11 Thread Ivan Andika (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057756#comment-18057756
 ] 

Ivan Andika edited comment on RATIS-2403 at 2/11/26 8:21 AM:
-

[~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me 
think how to approach this.

FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote 
some prototype of block append log and benchmarked it 
([^leader-backpressure.patch]). I benchmarked it with 
raft.server.write.follower.gap.ratio.max=1 . The result is that read is still 
within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing 
writes would speed up the reads. However, this solution is not flexible enough 
and should not be deployed.

LLM also tried to approach it by implementing SingleFlight pattern by making 
multiple follower read requests to share one ReadIndex RPC. So the later 
follower read requests will join the ongoing ReadIndex RPC. However, this seems 
to be invalid and it will violate linearizability since the ReadIndex might 
already be stale for the latter requests.


was (Author: JIRAUSER298977):
[~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me 
think how to approach this.

FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote 
some prototype of block append log and benchmarked it 
([^leader-backpressure.patch]). I benchmarked it with 
raft.server.write.follower.gap.ratio.max=1 . The result is that read is still 
within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing 
writes would speed up the reads. However, this solution is not flexible enough 
and should not be deployed.

> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
>  Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the 
> more requests go to the followers instead of the leader, the better write 
> throughput becomes, we saw around 2-3x write throughput increase compared to 
> the leader-only write and read (most likely due to less leader resource 
> contention). However, the read throughput becomes worst than leader-only 
> write and read  (some can be below 0.2x). Even with optimizations such as 
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, 
> the read throughput remains worse than leader-only write (it even improves 
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at 
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only 
> write and reads. Currently pure reads (no writes) performance improves read 
> throughput up to 1.7x, but total follower read throughput is way below this 
> target.
> Currently my ideas are
>  * Sacrificing writes for reads: Can we limit the write QPS so that read QPS 
> can increase
>  ** From the benchmark, the read throughput only improves when write 
> throughput is lower
>  ** We can try to use backpressure mechanism so that writes do not advance so 
> quickly that read throughput suffer
>  *** Follower gap mechanisms (RATIS-1411), but this might cause leader to 
> stall if follower down for a while (e.g. restarted), which violates the 
> majority availability guarantee. It's also hard to know which value is 
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)