About our usage of LWT in Cassandra related code

[email protected] (OpenPaaS) Thu, 03 Dec 2020 23:22:17 -0800

Hi,

I'm currently trying to increase overall efficiency of the Distributed
James server.

As such, I'm pocking around for improvement areas and found a huge topic
around LWT.

My conclusions so far are that we should keep LWT and SERIAL consistency
level out of the most common use cases.

I know that this is a massive change in regard of the way the project
had been working with Cassandra in the past few years. I would
definitely, in the middle term, would like to reach LWT free reads on
the Cassandra Mailbox to scale the deployments I am responsible of as
part of my Linagora job (my long term goal being to decrease the total
cost of ownership of a "Distributed James" based solution). While I am
not opposed to diverge from the Apache James project on this point, if
needed, I do believe an efficient distributed server (with the
consequences it implies in term of eventual consistency) might be a
strong asset for the Apache project as well, and would prefer to see
this work lending on the James project.

I've been ambitious on the ADR writing, especially in the complementary
work section. Let's see which consensual ground we find on that! (the ML
version here below serving as a public, immutable reference of my thinking!)

Cheers,

Benoit

-------------------------------------------------------------------

## Context

As any kind of server James needs to provide some level of consistencies.

Strong consistency can be achieved with Cassandra by relying on
LightWeight transactions. This enables
optimistic transactions on a single partition key.

Under the hood, Cassandra relies on the PAXOS algorithm to achieve
consensus across replica allowing us
to achieve linearizable consistency at the entry level. To do so,
Cassandra tracks consensus in a system.paxos
table. This `system.paxos` table needs to be checked upon reads as well
in order to ensure the latest state of the ongoing
consensus is known. This can be achieved by using the SERIAL consistency
level.

Experiments on a distributed James cluster (4 James nodes, having 4 CPU
and 8 GB of RAM each, and a 3 node Cassandra
cluster of 32 GB of RAM, 8 CPUs, and SSD disks) demonstrated that the
system.paxos table was by far the most read
and compacted table (ratio 5).

The table triggering the most reads to the `system.paxos` table was the
`acl` table. Deactivating LWT on this table alone
(lightweight transactions & SERIAL consistency level) enabled an instant
80% throughput, latencies reductions
as well as softer degradations when load breaking point is exceeded.

## Decision

Rely on `event sourcing` to maintain a projection of ACLs that do not
rely on LWT or SERIAL consistency level.

Event sourcing is thus responsible of handling concurrency and race
conditions as well as governing denormalization
for ACLs. It can be used as a source of truth to re-build ACL projections.

Note that the ACL projection tables can end up being out of
synchronization from the aggregate but we still have a
non-questionable source of truth handled via event sourcing.

## Consequences

We expect a better load handling, better response time, and cheaper
operation costs for Distributed James while not
compromising the data safety of ACL operations.

ACL updates being a rare operation, we do not expect significant
degradation of write performance by relying on
`eventSourcing`.

We need to implement a corrective task to fix the ACL denormalization
projections. Applicative read repairs could be
implemented as well, offering both diagnostic and on-the-fly corrections
without admin actions (a low probability should
however be used as loading an event sourcing aggregate is not a cheap
thing).

## Complementary work

There are several other places where we rely on Lightweight transaction
in the Cassandra code base and
that we might want to challenge:

- `users` we rely on LWT for throwing "AlreadyExist" exceptions. LWT
are likely unnecessary as the webadmin
presentation layer is offering an idempotent API (and silents the
AlreadyExist exceptions). Only the CLI
(soon to be deprecated for Guice products) makes this distinction.
Discussions have started on the topic and a proof of
concept is available.
- `domains` we rely on LWT for throwing "AlreadyExist" exceptions. LWT
are likely unnecessary as the webadmin
presentation layer is offering an idempotent API (and silents the
AlreadyExist exceptions). Only the CLI
(soon to be deprecated for Guice products) makes this distinction.
Discussions have started on the topic and a proof of
concept is available.
- `mailboxes` relies on LWT to enforce name unicity. We hit the same
pitfalls than for ACLs as this is a very often
read table (however mailboxes of a given user being grouped together,
primary key read are more limited hence this is
less critical). Similar results could be expected. Discussions on this
topic have not been started yet. Further
impact studies on performance needs to be conducted.
- `messages` as flags update is so far transactional. However, by
better relying on the table structure used to store
flags we could be relying on Cassandra to solve data race issues for us.
Note also that IMAP CONDSTORE extension is not
implemented, and might be a non-viable option performance-wise. We might
choose to favor performance other
transactionality on this topic. Discussions on this topic have not
started yet.

LWT are required for `eventSourcing`. As event sourcing usage is limited
to low-usage use cases, the performance
degradations are not an issue.

LWT usage is required to generate `UIDs`. As append message operations
tend to be limited compared to
message update operations, this is likely less critical. UID generation
could be handled via alternative systems,
past implementations have been conducted on ZooKeeper.

If not implementing IMAP CONDSTORE, generation of IMAP `MODSEQ` likely
no longer makes sense. As such the fate of
`MODSEQ` is linked to decisions on the `message` topic.

Similarly, LWT are used to try to keep the count of emails in
MailRepository synchronize. Such a usage is non-performance
critical for a MDA (Mail Delivery Agent) use case but might have a
bigger impact for MTA (Mail Transfer Agent). No
discussion not work have been started on the topic.

Other usage of LWT includes Sieve script management, initialization of
the RabbitMQMailQueue browse start and other
low-impact use cases.

## References

* [Original pull request exploring the
topic](https://github.com/apache/james-project/pull/255):
`JAMES-3435 Cassandra: No longer rely on LWT for domain and users`
* [JIRA ticket](https://issues.apache.org/jira/browse/JAMES-3435)
* [Pull request abandoning LWT on reads for mailbox
ACL](https://github.com/linagora/james-project/pull/4103)
* [ADR-42 Applicative read
repairs](https://github.com/apache/james-project/blob/master/src/adr/0042-applicative-read-repairs.md)
* [ADR-21 ACL
inconsistencies](https://github.com/apache/james-project/blob/master/src/adr/0021-cassandra-acl-inconsistency.md)
* [Buggy IMAP CONDSTORE](https://issues.apache.org/jira/browse/JAMES-2055)
* [Link to the Mailing list thread discussing this ADR](LINK TO BE INCLUDED)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

About our usage of LWT in Cassandra related code

Reply via email to