[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064984#comment-18064984 ] Ivan Andika commented on RATIS-2403: Found a paper that might be useful: [https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf] It introduces Almost-Local Reads (ALRS) that have some kind of batching implementation, but seems to be batching reads on the follower instead of the batching writes on the leader (which is what we are pursuing). The paper claims it is linearizable with significant performance improvement (2.5x for Raft). > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement > Components: Linearizable Read >Reporter: Ivan Andika >Assignee: Ivan Andika >Priority: Major > Attachments: 1362_review.patch, leader-backpressure.patch, > leader-batch-write.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060822#comment-18060822
]
Tsz-wo Sze commented on RATIS-2403:
---
{quote}... If you think this can be safely pushed upstream, I can raise a PR
for this.
{quote}
Similar to RATIS-2379, we may make it configurable. Since RATIS-2379 is not yet
released, we indeed should do it now and update the conf
- raft.server.read.read-index.applied-index.enabled
to something like
- raft.server.read.read-index.type with enum COMMIT_INDEX, APPILED_INDEX and
REPLIED_INDEX.
> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060647#comment-18060647 ] Ivan Andika commented on RATIS-2403: > Since it just delays the replies, the write throughput should not be > degraded. I think the problem is at the client side (i.e. the benchmark) – > the client may wait for the previous reply before sending another request. If > it is the case, using more clients should be able to keep the write > throughput. Yeah, that makes sense, maybe the write throughput degradation issue can be reduced. However, since in Ozone case the handler thread handling writes will be blocked waiting for RaftServerImpl#submitClientRequestAsync to complete, we might also need to scale the handler. > BTW, the code should replace the reference but not copying the list. Also, > use LinkedList for avoiding ArrayList resizing. Thanks for the suggestion. This is generated by LLM since it's currently for PoC, I will try to refine it further (e.g. by using AwaitForSignal instead of sleeping). If you think this can be safely pushed upstream, I can raise a PR for this. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch, leader-batch-write.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060444#comment-18060444
]
Tsz-wo Sze commented on RATIS-2403:
---
[~ivanandika], thanks for trying out the leader-batch-write!
bq. ... but the write throughput can be degraded to 0.7x (one case degrades to
0.3x), ...
Since it just delays the replies, the write throughput should not be degraded.
I think the problem is at the client side (i.e. the benchmark) -- the client
may wait for the previous reply before sending another request.
It makes sense if the latency is degraded.
BTW, the code should replace the reference but not copying the list. Also, use
LinkedList for avoiding ArrayList resizing.
{code}
//LeaderStateImpl
private final AtomicReference> heldReplies = ...
...
private void flushReplies() {
if (heldReplies.get().isEmpty()) {
return;
}
final List toFlush = heldReplies.getAndSet(new LinkedList<>());
...
}
{code}
> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch, leader-batch-write.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves when write
> throughput is lower
> ** We can try to use backpressure mechanism so that writes do not advance so
> quickly that read throughput suffer
> *** Follower gap mechanisms (RATIS-1411), but this might cause leader to
> stall if follower down for a while (e.g. restarted), which violates the
> majority availability guarantee. It's also hard to know which value is
> optimal for different workloads.
> Raising this ticket for ideas. [~szetszwo] [~tanxinyu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060285#comment-18060285
]
Ivan Andika commented on RATIS-2403:
Thank you [~tanxinyu] and apologies for the late reply. Happy CNY, I was also
celebrating CNY last week.
{quote}*1. Regarding follower read usage in IoTDB*
{quote}
Thanks for the insight regarding the usage of linearizable read on IoTDB. The
control plane and data plane setup is similar to Ozone's OM and Datanodes,
where the control plane contains a single Raft group and the datanodes uses
multi-raft setup to distribute load. Currently, we are encountering bottleneck
on the control nodes (some high RPC queue issues on the OM leader mostly due to
the lock contentions (e.g. multiple keys operations need to get the same bucket
lock) which blocks caller threads). We intend to enable follower read to try to
improve the read throughput of OM, but it has the unintended effect of reducing
read throughput while increasing write throughput. For the data planes, similar
to IoTDB we also uses multi-Raft (a single Raft group is called a "Ratis
Pipeline") to balance load at the data-plane level, but unlike IoTDB the reads
on data-plane do not use Ratis to read the data, instead we use an internal
sequence ID / version (called Block Sequence ID) which is correspond to the
Raft log index, when user wants to read the key, it will check the BCSID stored
in the meta nodes and compare with the ones in the datanode replica, if it's
mismatched it means that the datanode replica is stale and client will pick
another node. The Raft groups in Ozone pipelines are also relatively
short-lived, where it can be destroyed and recreated in response to failures,
migration, etc.
{quote}*2. On the Ozone scenario*
{quote}
> Is it because write throughput becomes higher after enabling follower read?
Yes, this is the current behavior I observed.
{quote}{*}3. On leader throttling approaches{*}{*}{{*}}
{quote}
Yes, the maximum gap solution has a lot of issues (as you mentioned) and only
serves as a way to prove my hypothesis (that there is a inverse relationship
between write throughput and read throughput). I have attached the approach
suggested by [~szetszwo] [^leader-batch-write.patch] . Although it performs
better than the leader throttling approaches, it is also inflexible since we
don't know how frequent we need to batch write get the desired read and write
performance and the interval cannot dynamically adjust based on the current
workloads (read-heavy or write-heavy). I tested that 10ms improved the read
throughput 1.3-1.7x, but the write throughput can be degraded to 0.7x (one case
degrades to 0.3x), although there is one case where read and write throughput
improved by 1.x to 1.4x. When I decreased it to 5ms, the read QPS throughput
degraded again for write-heavy cases (although there are one case where both
read and write throughput improved by 1.2x-1.3x). Therefore, these solutions
require periodic and careful tuning which is not ideal and therefore might not
be good to be pushed to Ratis upstream.
I had some thoughts regarding the this issue and identified issues in my
benchmark setup. I use 100 clients on both the leader read setup (baseline) and
the follower-read setup (client can read from both leader and followers). The
realization is that 100 clients follower read will never be better than 100
client leader read unless the leader is already saturated (high resource usage
or RPC queue latency is higher than readindex) since leader will respond almost
immediately but follower will need to wait (even 1-2ms is a lot higher than
immediate return). A better setup for the follower reads might setup two
concurrent setups 1) 100 clients for leader reads 2) 200 clients for follower
reads, each 100 clients are assigned to each follower (assuming 2 followers 1
leader) so that each node is connected by 100 clients. If we can verify that
the leader is not degraded when we added the additional 200 clients to
exclusively to the followers, this means that follower read works when there
are new read throughput. That said, this also means that we cannot simply
enable follower reads on the existing workload since the existing workload read
throughput might be degraded. I will benchmark and validate this.
Another problem is that the benchmark only uses a single client which should
make the Ozone / Hadoop Rate Limiting ineffective. Ozone / Hadoop rate limiting
will deprioritize users (identified by username) to lower queue if they send a
lot requests (which is weighted based on things like how long the request holds
exclusive / write or shared / read locks), but if there is only one user, these
QoS will not work. Therefore, I'll try to change my benchmark to have different
users to see whether the Rate limiter will be able to balance the read and
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[
https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058852#comment-18058852
]
Xinyu Tan commented on RATIS-2403:
--
[~ivanandika]
Sorry for the delayed reply — we’ve just been celebrating the Chinese New Year
here in China. Thank you again for your patience and for the thoughtful
discussion. I’d like to share a few points:
*1. Regarding follower read usage in IoTDB*
In IoTDB, we do not use follower read. Instead, we rely on linearizable read
and lease read.
The main reason is that our cluster size is typically on the order of dozens of
nodes. For control-plane nodes, although we usually have only a single Raft
group, we have not yet encountered bottlenecks at the Raft layer. For
data-plane nodes, we use a multi-Raft architecture to distribute load. As long
as there are no significant hotspots, read and write traffic can be evenly
distributed across leaders of different Raft groups, and therefore across all
nodes.
In some scenarios, we observed that enabling follower read requires maintaining
certain state machine cache states consistently across multiple replicas, which
introduces additional overhead. Considering all of this, we chose to use
multi-Raft to balance load at the data-plane level, and within each Raft group,
we keep both reads and writes on the leader to maximize resource efficiency.
*2. On the Ozone scenario*
In your Ozone scenario, if the system has not yet reached any physical
bottleneck (CPU, disk, network, etc.), why would enabling follower read
actually reduce query throughput?
Is it because write throughput becomes higher after enabling follower read? If
so, that would suggest we need to introduce some form of write throttling to
rebalance resource usage. If not, then it may indicate other bottlenecks, and
we probably need to profile the system more carefully to identify the root
cause.
*3. On leader throttling approaches*
I reviewed your {{{}leader-backpressure.patch{}}}. Personally, I don’t think
defining a maximum gap between the leader and the slowest follower is a
reasonable solution.
In a production environment, even if we configure a practical threshold, once a
follower has been down for a sufficiently long time, the gap will eventually
exceed the limit, which could block the entire Raft group from making further
progress on writes. This violates the liveness guarantees of Raft.
Moreover, in a three-replica Raft group, even if one follower lags
significantly behind, it can theoretically catch up later via snapshot
installation. Therefore, tying write availability to the slowest follower’s gap
does not seem like a robust throttling strategy.
Instead, I would prefer limiting the rate at which the leader advances the
{{{}commitIndex{}}}. For example, we could detect the advancement rate of
{{commitIndex}} inside {{{}tryAcquirePendingRequest{}}}, and if it is
progressing too quickly, temporarily block or slow down new writes.
Although this approach may require case-by-case tuning (e.g., how many write
ops per second to allow), it provides a clear upper bound on the resources
consumed by writes, thereby reserving sufficient headroom for read queries.
> Improve linearizable follower read throughput instead of writes
> ---
>
> Key: RATIS-2403
> URL: https://issues.apache.org/jira/browse/RATIS-2403
> Project: Ratis
> Issue Type: Improvement
>Reporter: Ivan Andika
>Priority: Major
> Attachments: leader-backpressure.patch
>
>
> While benchmarking linearizable follower read, the observation is that the
> more requests go to the followers instead of the leader, the better write
> throughput becomes, we saw around 2-3x write throughput increase compared to
> the leader-only write and read (most likely due to less leader resource
> contention). However, the read throughput becomes worst than leader-only
> write and read (some can be below 0.2x). Even with optimizations such as
> RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379,
> the read throughput remains worse than leader-only write (it even improves
> the write performance instead of the read performance).
> I suspect that because write throughput increase, the read index increases at
> a faster rate which causes follower linearizable read to wait longer.
> The target is to improve read throughput by 1.5x - 2x of the leader-only
> write and reads. Currently pure reads (no writes) performance improves read
> throughput up to 1.7x, but total follower read throughput is way below this
> target.
> Currently my ideas are
> * Sacrificing writes for reads: Can we limit the write QPS so that read QPS
> can increase
> ** From the benchmark, the read throughput only improves wh
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058105#comment-18058105 ] Ivan Andika commented on RATIS-2403: [~tanxinyu] Thanks for the feedback and ideas. FYI my current benchmark setup: * Setup the baseline (leader only read and write) * Each benchmark is setup to have the following write/read workloads, ranging from write only to read only ** 100% Write ** 100% Read ** 10% Write, 10% Read ** 30% Write, 70% Read ** 90% Write, 10% Read * There will be 100 client threads, with the following configuration ** Random: Each client thread picks a random node (can be leader and follower) ** Follower only: Each client thread picks only follower Regarding the high pressure or saturation, currently Ozone Manager is not able to hit the physical resources limitation (CPU, I/O, Network) since it's protected by backpressure mechanism like RPC queue, RPC handlers and synchronization mechanism like key lock. However, when enabling follower reads, even when there are separate RPC queues and RPC handlers and less lock contention (since OM nodes do not share locks), the read throughput suffer quite a bit. However, when I tried to throttle the write requests, the read requests improved dramatically. Regarding the Rate Limiting, currently Ozone follows Hadoop FairCallQueue implementation ([https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FairCallQueue.html)] where each user requests is weighted based on things like how long it holds the lock, etc. The user will then be deprioritized to lower queue so that (for example for every 1 requests served in lower queue 2, 2 requests are served in higher queue 1). I tried to rate limit writes and it does yields good improvement on read result (while writes now degrades), but the issue my current rate limiting is not flexible and might regress if there is a workload changes (e.g. more writes). Let me try to replicate your methodology to see if it can uncover other bottlenecks. [~tanxinyu] Btw, can I check whether linearizable follower read (with / without lease) has been widely used in production for IoTDB? If it does, this means that the implementation is already production-ready and the bottleneck might be on Ozone-side. It will be great if you have some blogs or links regarding the benchmark when compared to leader-only workloads so we can see the expected speedup (currently my target is 1.5x-2x read throughput increase with no write throughput degradation). > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18058029#comment-18058029 ] Xinyu Tan commented on RATIS-2403: -- [~ivanandika] I believe the core issue of balancing read/write performance needs to be examined from two distinct perspectives: # Effectiveness of Follower read: Under a fixed load where the Leader has not yet reached its physical bottleneck, we need to verify whether enabling Follower Read effectively balances the query load across all replicas. If we observe a query throughput increase nearly proportional to the number of replicas while write throughput remains stable, it proves that the current Follower Read mechanism in Ratis is effective in offloading the Leader. # Resource Zero-Sum Game under High Pressure: In extreme stress-test scenarios, the total read/write capacity of the consensus group is capped by physical resources (CPU, I/O, Network). A clear example is: even without Follower Read, if we pin all reads and writes to the Leader until it bottlenecks, any further increase in write load will inevitably decrease query throughput. This phenomenon persists even with Follower Read enabled, as faster writes force Followers to consume more resources for log synchronization and application (Apply), which in turn encroaches on query resources. Based on the above analysis, I suggest the following directions: * Introduce Admission Control (Rate Limiting): Simply optimizing the algorithm cannot change the total resource limit. To fundamentally address the mutual interference between reads and writes, we might need a rate-limiting mechanism for writes. This would allow us to explicitly define a resource ceiling for writes within this trade-off, leaving guaranteed headroom for query throughput. * Enhance Observability and Identify Optimization Opportunities:I recommend analyzing disk IOPS, network bandwidth, and CPU flame graphs during future stress tests. Quantitative data—such as WAL write latency, gRPC serialization overhead, or state machine lock contention—is critical for pinpointing bottlenecks. For instance, in a previous optimization where I simply batched the put operations for the WAL blocking queue, I managed to save nearly 20% of CPU usage for Apache IoTDB. I believe that under current stress-test scenarios, we can uncover many more similar optimization opportunities in Ratis by using profiling tools. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057756#comment-18057756 ] Ivan Andika commented on RATIS-2403: [~szetszwo] Thanks for the batching idea. That sounds like a good idea. Let me think how to approach this. FYI, regarding the previous leader backpressure mechanism, I asked LLM to wrote some prototype of block append log and benchmarked it ([^leader-backpressure.patch]). I benchmarked it with raft.server.write.follower.gap.ratio.max=1 . The result is that read is still within 1.5x, but the write degrades to 0.2x-0.5x. This validates that slowing writes would speed up the reads. However, this solution is not flexible enough and should not be deployed. > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > Attachments: leader-backpressure.patch > > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RATIS-2403) Improve linearizable follower read throughput instead of writes
[ https://issues.apache.org/jira/browse/RATIS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057643#comment-18057643 ] Tsz-wo Sze commented on RATIS-2403: --- [~ivanandika], thanks for benchmarking and providing the interesting ideas for improvement. One thing we could try is to batch the writes -- instead of processing the writes one by one, process them in batches. It is to make the appliedIndex (i.e. the readIndex) change less frequently. For example, - Processing one by one: appliedIndex increments by 1ms - Processing in batches (time batching): appliedIndex increments every 10ms Of course, we still want to keep the state machine busy. We will keep applying the transaction one by one but reply to the clients in batches. So, we may need one more index, repliedIndex, and use it as the readIndex (instead of using appliedIndex). > Improve linearizable follower read throughput instead of writes > --- > > Key: RATIS-2403 > URL: https://issues.apache.org/jira/browse/RATIS-2403 > Project: Ratis > Issue Type: Improvement >Reporter: Ivan Andika >Priority: Major > > While benchmarking linearizable follower read, the observation is that the > more requests go to the followers instead of the leader, the better write > throughput becomes, we saw around 2-3x write throughput increase compared to > the leader-only write and read (most likely due to less leader resource > contention). However, the read throughput becomes worst than leader-only > write and read (some can be below 0.2x). Even with optimizations such as > RATIS-2392 RATIS-2382 [https://github.com/apache/ratis/pull/1334] RATIS-2379, > the read throughput remains worse than leader-only write (it even improves > the write performance instead of the read performance). > I suspect that because write throughput increase, the read index increases at > a faster rate which causes follower linearizable read to wait longer. > The target is to improve read throughput by 1.5x - 2x of the leader-only > write and reads. Currently pure reads (no writes) performance improves read > throughput up to 1.7x, but total follower read throughput is way below this > target. > Currently my ideas are > * Sacrificing writes for reads: Can we limit the write QPS so that read QPS > can increase > ** From the benchmark, the read throughput only improves when write > throughput is lower > ** We can try to use backpressure mechanism so that writes do not advance so > quickly that read throughput suffer > *** Follower gap mechanisms (RATIS-1411), but this might cause leader to > stall if follower down for a while (e.g. restarted), which violates the > majority availability guarantee. It's also hard to know which value is > optimal for different workloads. > Raising this ticket for ideas. [~szetszwo] [~tanxinyu] -- This message was sent by Atlassian Jira (v8.20.10#820010)
