This ADR can also be reviewed on Github: https://github.com/apache/james-project/pull/271
Le 04/12/2020 à 14:22, btell...@linagora.com (OpenPaaS) a écrit : > Hi, > > I'm currently trying to increase overall efficiency of the Distributed > James server. > > As such, I'm pocking around for improvement areas and found a huge topic > around LWT. > > My conclusions so far are that we should keep LWT and SERIAL consistency > level out of the most common use cases. > > I know that this is a massive change in regard of the way the project > had been working with Cassandra in the past few years. I would > definitely, in the middle term, would like to reach LWT free reads on > the Cassandra Mailbox to scale the deployments I am responsible of as > part of my Linagora job (my long term goal being to decrease the total > cost of ownership of a "Distributed James" based solution). While I am > not opposed to diverge from the Apache James project on this point, if > needed, I do believe an efficient distributed server (with the > consequences it implies in term of eventual consistency) might be a > strong asset for the Apache project as well, and would prefer to see > this work lending on the James project. > > I've been ambitious on the ADR writing, especially in the complementary > work section. Let's see which consensual ground we find on that! (the ML > version here below serving as a public, immutable reference of my thinking!) > > Cheers, > > Benoit > > ------------------------------------------------------------------- > > ## Context > > As any kind of server James needs to provide some level of consistencies. > > Strong consistency can be achieved with Cassandra by relying on > LightWeight transactions. This enables > optimistic transactions on a single partition key. > > Under the hood, Cassandra relies on the PAXOS algorithm to achieve > consensus across replica allowing us > to achieve linearizable consistency at the entry level. To do so, > Cassandra tracks consensus in a system.paxos > table. This `system.paxos` table needs to be checked upon reads as well > in order to ensure the latest state of the ongoing > consensus is known. This can be achieved by using the SERIAL consistency > level. > > Experiments on a distributed James cluster (4 James nodes, having 4 CPU > and 8 GB of RAM each, and a 3 node Cassandra > cluster of 32 GB of RAM, 8 CPUs, and SSD disks) demonstrated that the > system.paxos table was by far the most read > and compacted table (ratio 5). > > The table triggering the most reads to the `system.paxos` table was the > `acl` table. Deactivating LWT on this table alone > (lightweight transactions & SERIAL consistency level) enabled an instant > 80% throughput, latencies reductions > as well as softer degradations when load breaking point is exceeded. > > ## Decision > > Rely on `event sourcing` to maintain a projection of ACLs that do not > rely on LWT or SERIAL consistency level. > > Event sourcing is thus responsible of handling concurrency and race > conditions as well as governing denormalization > for ACLs. It can be used as a source of truth to re-build ACL projections. > > Note that the ACL projection tables can end up being out of > synchronization from the aggregate but we still have a > non-questionable source of truth handled via event sourcing. > > ## Consequences > > We expect a better load handling, better response time, and cheaper > operation costs for Distributed James while not > compromising the data safety of ACL operations. > > ACL updates being a rare operation, we do not expect significant > degradation of write performance by relying on > `eventSourcing`. > > We need to implement a corrective task to fix the ACL denormalization > projections. Applicative read repairs could be > implemented as well, offering both diagnostic and on-the-fly corrections > without admin actions (a low probability should > however be used as loading an event sourcing aggregate is not a cheap > thing). > > ## Complementary work > > There are several other places where we rely on Lightweight transaction > in the Cassandra code base and > that we might want to challenge: > > - `users` we rely on LWT for throwing "AlreadyExist" exceptions. LWT > are likely unnecessary as the webadmin > presentation layer is offering an idempotent API (and silents the > AlreadyExist exceptions). Only the CLI > (soon to be deprecated for Guice products) makes this distinction. > Discussions have started on the topic and a proof of > concept is available. > - `domains` we rely on LWT for throwing "AlreadyExist" exceptions. LWT > are likely unnecessary as the webadmin > presentation layer is offering an idempotent API (and silents the > AlreadyExist exceptions). Only the CLI > (soon to be deprecated for Guice products) makes this distinction. > Discussions have started on the topic and a proof of > concept is available. > - `mailboxes` relies on LWT to enforce name unicity. We hit the same > pitfalls than for ACLs as this is a very often > read table (however mailboxes of a given user being grouped together, > primary key read are more limited hence this is > less critical). Similar results could be expected. Discussions on this > topic have not been started yet. Further > impact studies on performance needs to be conducted. > - `messages` as flags update is so far transactional. However, by > better relying on the table structure used to store > flags we could be relying on Cassandra to solve data race issues for us. > Note also that IMAP CONDSTORE extension is not > implemented, and might be a non-viable option performance-wise. We might > choose to favor performance other > transactionality on this topic. Discussions on this topic have not > started yet. > > LWT are required for `eventSourcing`. As event sourcing usage is limited > to low-usage use cases, the performance > degradations are not an issue. > > LWT usage is required to generate `UIDs`. As append message operations > tend to be limited compared to > message update operations, this is likely less critical. UID generation > could be handled via alternative systems, > past implementations have been conducted on ZooKeeper. > > If not implementing IMAP CONDSTORE, generation of IMAP `MODSEQ` likely > no longer makes sense. As such the fate of > `MODSEQ` is linked to decisions on the `message` topic. > > Similarly, LWT are used to try to keep the count of emails in > MailRepository synchronize. Such a usage is non-performance > critical for a MDA (Mail Delivery Agent) use case but might have a > bigger impact for MTA (Mail Transfer Agent). No > discussion not work have been started on the topic. > > Other usage of LWT includes Sieve script management, initialization of > the RabbitMQMailQueue browse start and other > low-impact use cases. > > ## References > > * [Original pull request exploring the > topic](https://github.com/apache/james-project/pull/255): > `JAMES-3435 Cassandra: No longer rely on LWT for domain and users` > * [JIRA ticket](https://issues.apache.org/jira/browse/JAMES-3435) > * [Pull request abandoning LWT on reads for mailbox > ACL](https://github.com/linagora/james-project/pull/4103) > * [ADR-42 Applicative read > repairs](https://github.com/apache/james-project/blob/master/src/adr/0042-applicative-read-repairs.md) > * [ADR-21 ACL > inconsistencies](https://github.com/apache/james-project/blob/master/src/adr/0021-cassandra-acl-inconsistency.md) > * [Buggy IMAP CONDSTORE](https://issues.apache.org/jira/browse/JAMES-2055) > * [Link to the Mailing list thread discussing this ADR](LINK TO BE INCLUDED) > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org > For additional commands, e-mail: server-dev-h...@james.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org