[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064984#comment-18064984 ] Ivan Andika edited comment on RATIS-2403 at 3/11/26 4:00 AM: - Found a paper that might be useful [https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf] Website is [https://law-theorem.com/] It introduces Almost-Local Reads (ALRS) that have some kind of batching implementation, but seems to be batching reads on the follower instead of the batching writes on the leader (which is what we are pursuing). The paper claims it is linearizable with significant performance improvement (2.5x for Raft). We can also check [https://github.com/vasigavr1/Odyssey] referenced in [https://law-theorem.com/artifact.pdf] Also attached initial LLM research analysis: [^LAW_THEOREM_RATIS_ANALYSIS.md] was (Author: JIRAUSER298977): Found a paper that might be useful [https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf] Website is https://law-theorem.com/ It introduces Almost-Local Reads (ALRS) that have some kind of batching implementation, but seems to be batching reads on the follower instead of the batching writes on the leader (which is what we are pursuing). The paper claims it is linearizable with significant performance improvement (2.5x for Raft). We can also check [https://github.com/vasigavr1/Odyssey] referenced in [https://law-theorem.com/artifact.pdf] > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement > Components: Linearizable Read >Reporter: Ivan Andika >Assignee: Ivan Andika >Priority: Major > Attachments: 1362_review.patch, LAW_THEOREM_RATIS_ANALYSIS.md, > leader-backpressure.patch, leader-batch-write.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064984#comment-18064984 ] Ivan Andika edited comment on RATIS-2403 at 3/11/26 3:56 AM: - Found a paper that might be useful [https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf] Website is https://law-theorem.com/ It introduces Almost-Local Reads (ALRS) that have some kind of batching implementation, but seems to be batching reads on the follower instead of the batching writes on the leader (which is what we are pursuing). The paper claims it is linearizable with significant performance improvement (2.5x for Raft). We can also check [https://github.com/vasigavr1/Odyssey] referenced in [https://law-theorem.com/artifact.pdf] was (Author: JIRAUSER298977): Found a paper that might be useful: [https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf] It introduces Almost-Local Reads (ALRS) that have some kind of batching implementation, but seems to be batching reads on the follower instead of the batching writes on the leader (which is what we are pursuing). The paper claims it is linearizable with significant performance improvement (2.5x for Raft). > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement > Components: Linearizable Read >Reporter: Ivan Andika >Assignee: Ivan Andika >Priority: Major > Attachments: 1362_review.patch, leader-backpressure.patch, > leader-batch-write.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060647#comment-18060647 ] Ivan Andika edited comment on RATIS-2403 at 2/24/26 8:40 AM: - > Since it just delays the replies, the write throughput should not be > degraded. I think the problem is at the client side (i.e. the benchmark) – > the client may wait for the previous reply before sending another request. If > it is the case, using more clients should be able to keep the write > throughput. Yeah, that makes sense, maybe the write throughput degradation issue can be reduced. However, since in Ozone case the handler thread handling writes will be blocked waiting for RaftServerImpl#submitClientRequestAsync to complete, we might also need to scale the handler. > BTW, the code should replace the reference but not copying the list. Also, > use LinkedList for avoiding ArrayList resizing. Thanks for the suggestion. The patch is mostly generated by LLM since it's currently for PoC, I will try to refine it further. If you think this can be safely pushed upstream, I can raise a PR for this. was (Author: JIRAUSER298977): > Since it just delays the replies, the write throughput should not be > degraded. I think the problem is at the client side (i.e. the benchmark) – > the client may wait for the previous reply before sending another request. If > it is the case, using more clients should be able to keep the write > throughput. Yeah, that makes sense, maybe the write throughput degradation issue can be reduced. However, since in Ozone case the handler thread handling writes will be blocked waiting for RaftServerImpl#submitClientRequestAsync to complete, we might also need to scale the handler. > BTW, the code should replace the reference but not copying the list. Also, > use LinkedList for avoiding ArrayList resizing. Thanks for the suggestion. This is generated by LLM since it's currently for PoC, I will try to refine it further (e.g. by using AwaitForSignal instead of sleeping). If you think this can be safely pushed upstream, I can raise a PR for this. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch, leader-batch-write.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060444#comment-18060444
]
Tsz-wo Sze edited comment on RATIS-2403 at 2/23/26 7:16 PM:
[~ivanandika], thanks for trying out the leader-batch-write!
bq. ... but the write throughput can be degraded to 0.7x (one case degrades to
0.3x), ...
Since it just delays the replies, the write throughput should not be degraded.
I think the problem is at the client side (i.e. the benchmark) -- the client
may wait for the previous reply before sending another request. If it is the
case, using more clients should be able to keep the write throughput.
It does make sense if the latency is degraded.
BTW, the code should replace the reference but not copying the list. Also, use
LinkedList for avoiding ArrayList resizing.
{code}
//LeaderStateImpl
private final AtomicReference> heldReplies = ...
...
private void flushReplies() {
if (heldReplies.get().isEmpty()) {
return;
}
final List toFlush = heldReplies.getAndSet(new LinkedList<>());
...
}
{code}
was (Author: szetszwo):
[~ivanandika], thanks for trying out the leader-batch-write!
bq. ... but the write throughput can be degraded to 0.7x (one case degrades to
0.3x), ...
Since it just delays the replies, the write throughput should not be degraded.
I think the problem is at the client side (i.e. the benchmark) -- the client
may wait for the previous reply before sending another request.
It makes sense if the latency is degraded.
BTW, the code should replace the reference but not copying the list. Also, use
LinkedList for avoiding ArrayList resizing.
{code}
//LeaderStateImpl
private final AtomicReference> heldReplies = ...
...
private void flushReplies() {
if (heldReplies.get().isEmpty()) {
return;
}
final List toFlush = heldReplies.getAndSet(new LinkedList<>());
...
}
{code}
> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060285#comment-18060285
]
Ivan Andika edited comment on RATIS-2403 at 2/23/26 10:00 AM:
--
Thank you [~tanxinyu] and apologies for the late reply. Happy CNY, I was also
celebrating CNY last week.
{quote}*1. Regarding follower read usage in IoTDB*
{quote}
Thanks for the insight regarding the usage of linearizable read on IoTDB. The
control plane and data plane setup is similar to Ozone's OM and Datanodes,
where the control plane contains a single Raft group and the datanodes uses
multi-raft setup to distribute load. Currently, we are encountering bottleneck
on the control nodes (some high RPC queue issues on the OM leader mostly due to
the lock contentions (e.g. multiple keys operations need to get the same bucket
lock) which blocks caller threads). We intend to enable follower read to try to
improve the read throughput of OM, but it has the unintended effect of reducing
read throughput while increasing write throughput. For the data planes, similar
to IoTDB we also uses multi-Raft (a single Raft group is called a "Ratis
Pipeline") to balance load at the data-plane level, but unlike IoTDB the reads
on data-plane do not use Ratis to read the data, instead we use an internal
sequence ID / version (called Block Sequence ID) which is correspond to the
Raft log index, when user wants to read the key, it will check the BCSID stored
in the meta nodes and compare with the ones in the datanode replica, if it's
mismatched it means that the datanode replica is stale and client will pick
another node. The Raft groups in Ozone pipelines are also relatively
short-lived, where it can be destroyed and recreated in response to failures,
migration, etc.
{quote}*2. On the Ozone scenario*
{quote}
> Is it because write throughput becomes higher after enabling follower read?
Yes, this is the current behavior I observed.
{quote}{*}3. On leader throttling approaches{*}{*}{{*}}
{quote}
Yes, the maximum gap solution has a lot of issues (as you mentioned) and only
serves as a way to prove my hypothesis (that there is a inverse relationship
between write throughput and read throughput). I have attached the approach
suggested by [~szetszwo] [^leader-batch-write.patch] . Although it performs
better than the leader throttling approaches, it is also inflexible since we
don't know how frequent we need to batch write get the desired read and write
performance and the interval cannot dynamically adjust based on the current
workloads (read-heavy or write-heavy). I tested that 10ms improved the read
throughput 1.3-1.7x, but the write throughput can be degraded to 0.7x (one case
degrades to 0.3x), although there is one case where read and write throughput
improved by 1.x to 1.4x. When I decreased it to 5ms, the read QPS throughput
degraded again for write-heavy cases (although there are one case where both
read and write throughput improved by 1.2x-1.3x). Therefore, these solutions
require periodic and careful tuning which is not ideal and therefore might not
be good to be pushed to Ratis upstream.
I had some thoughts regarding the this issue and identified issues in my
benchmark setup. I use 100 clients on both the leader read setup (baseline) and
the follower-read setup (client can read from both leader and followers). The
realization is that 100 clients follower read will never be better than 100
client leader read unless the leader is already saturated (high resource usage
or RPC queue latency is higher than readindex) since leader will respond almost
immediately but follower will need to wait (even 1-2ms is a lot higher than
immediate return). A better setup for the follower reads might setup two
concurrent setups 1) 100 clients for leader reads 2) 200 clients for follower
reads, each 100 clients are assigned to each follower (assuming 2 followers 1
leader) so that each node is connected by 100 clients. If we can verify that
the leader is not degraded when we added the additional 200 clients to
exclusively to the followers, this means that follower read works when there
are new read throughput. That said, this also means that we cannot simply
enable follower reads on the existing workload since the existing workload read
throughput might be degraded. I will benchmark and validate this.
Another problem is that the benchmark only uses a single client which should
make the Ozone / Hadoop Rate Limiting ineffective. Ozone / Hadoop rate limiting
will deprioritize users (identified by username) to lower queue if they send a
lot requests (which is weighted based on things like how long the request holds
exclusive / write or shared / read locks), but if there is only one user, these
QoS will not work. Therefore, I'll try to change my benchmark to have different
users to see whether th
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058105#comment-18058105 ] Ivan Andika edited comment on RATIS-2403 at 2/13/26 10:01 AM: -- [~tanxinyu] Thanks for the feedback and ideas. FYI my current benchmark setup: * Setup the baseline (leader only read and write) * Each benchmark is setup to have the following write/read workloads, ranging from write only to read only ** 100% Write ** 100% Read ** 10% Write, 10% Read ** 30% Write, 70% Read ** 90% Write, 10% Read * There will be 100 client threads, with the following configuration ** Random: Each client thread picks a random node (can be leader and follower) ** Follower only: Each client thread picks only follower Regarding the high pressure or saturation, currently Ozone Manager is not able to hit the physical resources limitation (CPU, I/O, Network) since it's protected by backpressure mechanism like RPC queue, RPC handlers and synchronization mechanism like key lock. However, when enabling follower reads, even when there are separate RPC queues and RPC handlers and less lock contention (since OM nodes do not share locks), the read throughput suffer quite a bit. However, when I tried to throttle the write requests, the read requests improved dramatically. I understand that if we push the leader to its limit, offloading any additional loads to follower should make the overall Raft group to be able to handle more throughput. Nonetheless, we also want to ensure that there are no throughput degradation in normal case. Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue implementation ([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)] where each user requests is weighted based on things like how long it holds the lock, etc. The user will then be deprioritized to lower queue so that (for example for every 1 requests served in lower queue 2, 2 requests are served in higher queue 1). I tried to rate limit writes and it does yields good improvement on read result (while writes now degrades), but the issue my current rate limiting is not flexible and might regress if there is a workload changes (e.g. more writes). Let me try to replicate your methodology to see if it can uncover other bottlenecks. [~tanxinyu] Btw, can I check whether linearizable follower read (with / without lease) has been widely used in production for IoTDB? If it does, this means that the implementation is already production-ready and the bottleneck might be on Ozone-side. It will be great if you have some blogs or links regarding the benchmark when compared to leader-only workloads so we can see the expected speedup (currently my target is 1.5x-2x read throughput increase with no write throughput degradation). was (Author: JIRAUSER298977): [~tanxinyu] Thanks for the feedback and ideas. FYI my current benchmark setup: * Setup the baseline (leader only read and write) * Each benchmark is setup to have the following write/read workloads, ranging from write only to read only ** 100% Write ** 100% Read ** 10% Write, 10% Read ** 30% Write, 70% Read ** 90% Write, 10% Read * There will be 100 client threads, with the following configuration ** Random: Each client thread picks a random node (can be leader and follower) ** Follower only: Each client thread picks only follower Regarding the high pressure or saturation, currently Ozone Manager is not able to hit the physical resources limitation (CPU, I/O, Network) since it's protected by backpressure mechanism like RPC queue, RPC handlers and synchronization mechanism like key lock. However, when enabling follower reads, even when there are separate RPC queues and RPC handlers and less lock contention (since OM nodes do not share locks), the read throughput suffer quite a bit. However, when I tried to throttle the write requests, the read requests improved dramatically. Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue implementation ([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)] where each user requests is weighted based on things like how long it holds the lock, etc. The user will then be deprioritized to lower queue so that (for example for every 1 requests served in lower queue 2, 2 requests are served in higher queue 1). I tried to rate limit writes and it does yields good improvement on read result (while writes now degrades), but the issue my current rate limiting is not flexible and might regress if there is a workload changes (e.g. more writes). Let me try to replicate your methodology to see if it can uncover other bottlenecks. [~tanxinyu] Btw, can I check whether linearizable follower read (with / without lease) has been widely used in production for IoTDB? If
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029 ] Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:04 AM: --- [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: * Effectiveness of Follower read: We can define effectiveness through the following experimental model: * ** Baseline: Both read and write loads are low, and the system has no resource bottlenecks. ** Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. ** Enabling Follower Read: Distribute these additional query loads evenly across all replicas. ** Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. * Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottlenecks. For instance, in a previous optimization where I simply batched the put operations for the WAL blocking queue, I managed to save nearly 20% of CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, we can uncover many more similar optimization opportunities in Ratis by using profiling tools. was (Author: tanxinyu): [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: We can define effectiveness through the following experimental model: * Baseline: Both read and write loads are low, and the system has no resource bottlenecks. * Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. * Enabling Follower Read: Distribute these additional query loads evenly across all replicas. * Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This w
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029 ] Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:04 AM: --- [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: * Effectiveness of Follower read: We can define effectiveness through the following experimental model: ** Baseline: Both read and write loads are low, and the system has no resource bottlenecks. ** Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. ** Enabling Follower Read: Distribute these additional query loads evenly across all replicas. ** Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. * Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottlenecks. For instance, in a previous optimization where I simply batched the put operations for the WAL blocking queue, I managed to save nearly 20% of CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, we can uncover many more similar optimization opportunities in Ratis by using profiling tools. was (Author: tanxinyu): [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: * Effectiveness of Follower read: We can define effectiveness through the following experimental model: * ** Baseline: Both read and write loads are low, and the system has no resource bottlenecks. ** Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. ** Enabling Follower Read: Distribute these additional query loads evenly across all replicas. ** Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. * Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029 ] Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:03 AM: --- [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: We can define effectiveness through the following experimental model: ** Baseline: Both read and write loads are low, and the system has no resource bottlenecks. ** Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. ** Enabling Follower Read: Distribute these additional query loads evenly across all replicas. ** Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottlenecks. For instance, in a previous optimization where I simply batched the put operations for the WAL blocking queue, I managed to save nearly 20% of CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, we can uncover many more similar optimization opportunities in Ratis by using profiling tools. was (Author: tanxinyu): [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: Under a fixed load where the Leader has not yet reached its physical bottleneck, we need to verify whether enabling Follower Read effectively balances the query load across all replicas. If we observe a query throughput increase nearly proportional to the number of replicas while write throughput remains stable, it proves that the current Follower Read mechanism in Ratis is effective in offloading the Leader. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottl
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029 ] Xinyu Tan edited comment on RATIS-2403 at 2/12/26 5:03 AM: --- [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: We can define effectiveness through the following experimental model: * Baseline: Both read and write loads are low, and the system has no resource bottlenecks. * Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. * Enabling Follower Read: Distribute these additional query loads evenly across all replicas. * Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottlenecks. For instance, in a previous optimization where I simply batched the put operations for the WAL blocking queue, I managed to save nearly 20% of CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, we can uncover many more similar optimization opportunities in Ratis by using profiling tools. was (Author: tanxinyu): [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: We can define effectiveness through the following experimental model: ** Baseline: Both read and write loads are low, and the system has no resource bottlenecks. ** Introducing a Bottleneck: Increase the query load on the Leader by N times (number of replicas), forcing the Leader into a bottlenecked state. At this point, write throughput will inevitably drop due to resource contention, and query throughput will also be capped. ** Enabling Follower Read: Distribute these additional query loads evenly across all replicas. ** Expected Outcome: If write performance returns to its baseline state (no longer interfered with by queries) while the total query throughput increases significantly without hitting physical limits across the group, it proves that Ratis’s Follower Read is truly effective in offloading the Leader and decoupling resources. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057756#comment-18057756 ] Ivan Andika edited comment on RATIS-2403 at 2/11/26 8:23 AM: - [~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me think how to approach this. FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote some prototype to block append log if the follower commit is behind the leader and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with raft.server.write.follower.gap.ratio.max=1 . The result is that read is still within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing writes would speed up the reads. However, this solution is not flexible enough and should not be deployed. LLM also tried to approach it by implementing SingleFlight pattern by making multiple follower read requests to share one ReadIndex RPC. So the later follower read requests will join the ongoing ReadIndex RPC. However, this seems to be invalid and it will violate linearizability since the ReadIndex might already be stale for the latter requests. was (Author: JIRAUSER298977): [~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me think how to approach this. FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote some prototype of block append log and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with raft.server.write.follower.gap.ratio.max=1 . The result is that read is still within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing writes would speed up the reads. However, this solution is not flexible enough and should not be deployed. LLM also tried to approach it by implementing SingleFlight pattern by making multiple follower read requests to share one ReadIndex RPC. So the later follower read requests will join the ongoing ReadIndex RPC. However, this seems to be invalid and it will violate linearizability since the ReadIndex might already be stale for the latter requests. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057756#comment-18057756 ] Ivan Andika edited comment on RATIS-2403 at 2/11/26 8:21 AM: - [~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me think how to approach this. FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote some prototype of block append log and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with raft.server.write.follower.gap.ratio.max=1 . The result is that read is still within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing writes would speed up the reads. However, this solution is not flexible enough and should not be deployed. LLM also tried to approach it by implementing SingleFlight pattern by making multiple follower read requests to share one ReadIndex RPC. So the later follower read requests will join the ongoing ReadIndex RPC. However, this seems to be invalid and it will violate linearizability since the ReadIndex might already be stale for the latter requests. was (Author: JIRAUSER298977): [~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me think how to approach this. FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote some prototype of block append log and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with raft.server.write.follower.gap.ratio.max=1 . The result is that read is still within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing writes would speed up the reads. However, this solution is not flexible enough and should not be deployed. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
