[
https://issues.apache.org/jira/browse/JAMES-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310397#comment-17310397
]
Benoit Tellier commented on JAMES-3435:
---------------------------------------
I had an exchange on this topic with Ilja Weis.
He pointed me to the following links:
https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-3.11.10
{code:java}
- This release fix a correctness issue with SERIAL reads, and LWT writes
that do not apply.
Unfortunately, this fix has a performance impact on read performance at
the SERIAL or
LOCAL_SERIAL consistency levels. For heavy users of such SERIAL reads,
the performance
impact may be noticeable and may also result in an increased of timeouts.
For that
reason, a opt-in system property has been added to disable the fix:
-Dcassandra.unsafe.disable-serial-reads-linearizability=true
Use this flag at your own risk as it revert SERIAL reads to the incorrect
behavior of
previous versions. See CASSANDRA-12126 for details.
{code}
More details are provided here:
https://issues.apache.org/jira/browse/CASSANDRA-12126
In short PAXOS setup for LightWeight Transaction LWT requires to commit an
empty update on each reads to be sure to not miss some in flight updates (in
some complex distributed failures edge cases). This results in a performance
hit: some users reports some SERIAL reads timeout upon high SMTP load.
Please note that James relies on linearizability - and not achieving these
guaranties will lead to message loss (UID being a per mailbox monotic integer,
long consistency might leads to message overwrites).
I am convinced that we do not need SERIAL reads upon regular reads, we just
need them as part of write transactions. We can thus reduce the SERIAL read
workload significantly.
Then, I wonder if we should not let people more choices. For instance regarding
flags updates, I would personally consider that lost updates are acceptable, as
it would end up being an inconvenience to the end user who might have to mark
his mail as read again. I would be glad to experiment a non LWT dependent flag
update. (experiments suggests that MODSEQ allocation acts as a sequencer that
limits concurrency upon flag updates - I was surprise that 20 updates conducted
on 4 threads would lead to inconsistent results only 25% of the time...)
I would advocate finer grain control of the LWT through configuration
properties. Namely:
{code:java}
| mailbox.read.strong.consistency
| Optional. Boolean, defaults to true. Disabling should be considered
experimental.
If enabled, regular consistency level is used for read transactions for
mailbox. Not doing so might result
in stale reads as the system.paxos table will not be checked for latest
updates. Better performance are expected
by turning it off. Note that reads performed as part of write transactions are
always performed with a strong
consistency.
| message.read.strong.consistency
| Optional. Boolean, defaults to true. Disabling should be considered
experimental.
If enabled, regular consistency level is used for read transactions for
message. Not doing so might result
in stale reads as the system.paxos table will not be checked for latest
updates. Better performance are expected
by turning it off. Note that reads performed as part of write transactions are
always performed with a strong
consistency.
| message.write.strong.consistency.unsafe
| Optional. Boolean, defaults to true. Disabling should be considered
experimental and unsafe.
If enabled, Lightweight transactions will no longer be used upon messages
operation (table `imapUidTable`).
As message flags updates relies so far on a read-before-write model, it exposes
yourself to data races leading to
potentially update loss. Better performance are expected
by turning it off. Reads performed as part of write transaction are also
performed with a relaxed consistency.
{code}
I propose myself to contribute this timely...
Also, alternative technologies might need to be explored to generate monotic
counters - UID & MODSEQ (see RFC-3501) not included in above proposal as
correctness is required.
- Discussing this very topic with @mbaechler https://atomix.io/ might be a
nice candidate - out of the box support for atomic counters but work would be
needed on the cluster membership side - a good option might be standalone
atomix agents
https://atomix.io/docs/latest/user-manual/deployment/kubernetes/... I still
remain questions regarding persistence... - I started a thread on their gitter.
- Historically the project did have code to handle UID and MODSEQ generation
through ZOOKEEPER - but was unmaintained and had been removed.
- Could consul be a candidate? https://www.consul.io/api-docs/kv &
https://www.consul.io/api/features/consistency (consistent)
I think contributions should be welcomed on the monotic integer topic to
provide alternatives to Cassandra LWT.
Given this deployed we should notice a sharp drop on the LWT usage, less
activity on the system.paxos table as well as a CPU usage decrease on the
Cassandra cluster.
> Relaxing LWT usage: domain, users
> ---------------------------------
>
> Key: JAMES-3435
> URL: https://issues.apache.org/jira/browse/JAMES-3435
> Project: James Server
> Issue Type: Improvement
> Components: cassandra
> Affects Versions: master
> Reporter: Benoit Tellier
> Priority: Major
> Fix For: master
>
>
> https://www.mail-archive.com/[email protected]/msg68713.html
> {code:java}
> Cassandra is an eventually consistent datastore, that can be used in a
> consistant fashion. To do so, we rely on a mechanism called "LightWeight
> Transactions (LWT)". Lightweight transactions relies on the PAXOS
> distributed consensus algorithm to enforce a condition upon data
> mutation. A table, system.paxos, is used to track the state of
> transactions. Furthermore, upon writes, several round-trips (two) are
> needed to ensure data integrity accross replica(minimum round trips to
> achieve consensus) and the system.paxos table is read / written to in
> addition to the applicative table.
> All of this causes LWT to be significantly slower than their lower
> consistency counterparts. On some Linagora owned production instances,
> regular reads takes 2ms while reads on tables relying on LWT takes 6ms.
> Similar figures are found for writes. We also noticed some high
> compaction throughtput on the paxos table, leading to many back-ground
> writes.
> Given the massive impact of LWT usage on performance, and given the lack
> of debate upon LWT adoption, I would like to re-challenge their usage...
> Here are the places we rely on LWT for the Distributed Server:
> - IMAP UID generation (monotic integer) - strong consistency is
> strictly required to not loose data as overwriting a uid means
> overwriting a message.
> - IMAP ModSeq generation (monotic integer) - strong consistency is
> required, as modseq overwrites can lead to some data not being well
> synchronised.
> - Domain and users - we rely on LWT to return an error when deleting a
> user that do not exist, or creating an already existing user. It sounds
> unecessary.
> - Message flags relies on LWT to ensure updates are not overwritten. As
> an often read metadata, the impact is high, for limited criticity for
> the end user. After all, no data is lost, only a user action like
> marking a message as Seen, an action that he can very well perform again
> - Mailbox path registration, ACL - required to prevent data races
> My proposal would be:
> - Keep using LWT for UID and modseq generation, as well as Mailbox path
> registration.
> - Make the use of LWT for message flags update configurable - as an
> admin I can choose to disable it.
> - I am also fine with completly removing LWT usage for message flags
> update.
> - No longer use LWT on domain or users. Instead use idempotent create /
> delete. The contract test will thus need to be relaxed.
> - On the long term, relying on a CRDT to represent ACLs at the
> Cassandra level, instead of serialized JSON, would enable to get rid of
> LWT usage on the ACL table.
> {code}
> Let's start relaxing LWT transaction for users & domains.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]