[
https://issues.apache.org/jira/browse/JAMES-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Tellier updated JAMES-3430:
----------------------------------
Attachment: Capture d’écran de 2020-10-20 12-11-19.png
> Reduce Cassandra overall load by restructuring messageV2 table
> --------------------------------------------------------------
>
> Key: JAMES-3430
> URL: https://issues.apache.org/jira/browse/JAMES-3430
> Project: James Server
> Issue Type: Improvement
> Components: cassandra, mailbox
> Affects Versions: master, 3.6.0
> Reporter: Benoit Tellier
> Priority: Major
> Fix For: master
>
> Attachments: Capture d’écran de 2020-10-20 12-11-19.png
>
>
> *Analysis*
> Today messagev2 metadata table is the largest table of our schema (11GB on 6
> million messages instalation).
> Reads to it are abnormally long, and this is the query we spend the more time
> executing while doing IMAP:
> (see attached screenshot)
> That is by far our most expensive query (per row), and that while LWT do not
> even come into play!
> Looking at table stats:
> {code:java}
> Number of partitions (estimate): 6100310
> SSTable Compression Ratio: 0.43905338269472705
> Space used (total): 2037635717
> {code}
> Doing a nodetool status, we realize that total occupied space in Cassandra is
> of `4.97 GiB`, so messagesv2 table occuppy 40% of total storage space which
> is a lot compared to other messages metadata tables (~180MB so 10x less).
> Doing a tablehistograms analysis:
> {code:java}
> Percentile SSTables Write Latency Read Latency Partition Size
> Cell Count
> (micros) (micros) (bytes)
>
> 50% 1.00 0.00 315.85 770
> 12
> 75% 1.00 0.00 545.79 770
> 12
> 95% 1.00 0.00 1131.75 1109
> 17
> 98% 1.00 0.00 1629.72 1331
> 17
> 99% 1.00 0.00 1955.67 1597
> 20
> Min 0.00 0.00 88.15 30
> 0
> Max 1.00 0.00 3379.39 1131752
> 770
> {code}
> We realize that cell count is high, and that the byte count is high (other
> message metadata table are between 125 -> 250 bytes so 6 times less).
> Knowing our data model, each message have a set of properties, each composed
> of a namespace, a name and a value. These are stored as a UDT list, thus is
> really space inefficient. Compress ratio (see above) do not compensate for
> this.
> These properties are well defined, set by the StoreMailboxManager and only
> include:
> - mediaType
> - subType
> - ContentId
> - ContentDescription
> - ContentTrasferEncoding
> - ContentLocation
> - ContentDispositionType
> - ContentDispositionParameters
> - ContentTypeParameters
> - ContentMD5
> - ContentLanguage
> - Charset
> - MimepartDelimites
> *Why?*
> Here is the conclusion:
> - Slow reads on messagev2 table have a large impact on IMAP performance (and
> on Cassandra performances)
> - This slowness is due to the corresponding space used on disk (more data =
> slower reads)
> - This extra space is due to an inefficient storage format of the property
> fields
> - By restructuring the way we store these properties we can reclaim disk
> space and thus query read speed
> *How?*
> - Remove unused properties: see
> https://github.com/linagora/james-project/pull/3925 (charset and mime part
> delimiters are unused)
> - Restructure CassandraMessageDAO (and underlying table) to store known
> properties in a column instead of a UDT list. Unknown properties shall be
> rejected.
> This avoids the use of a collection on a critical table thus should
> significantly fasten related operations.
> A data migration (messagev3) will be needed.
> *Definition of done*
> - Come up with space occupied per message update within tests for the old
> and the new
> - IMAP performance tests
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]