[ 
https://issues.apache.org/jira/browse/JAMES-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061861#comment-18061861
 ] 

Jean Helou commented on JAMES-4182:
-----------------------------------

Having compression sounds like a great idea !

The technical part of the proposal is not so clear to me. In the end I'm not 
sure what exactly you propose to implement.

As I understand it the problem can be expressed as this set of constraints
 * Compression is relevant for all BlobStore implementations
  * One might argue that it's even more important for c* or sql storage since 
storage prices for such systems are usually higher than for s3
 * We want to compress before we encrypt 
  * Encryption generates random or pseudo random byte streams that don't 
compress well
 * We want to be able to enable compression on existing data stores which means 
preserving backward compatible reads of uncompressed data
    * This was not implemented at the time for the AESBlobStoreDAO ( As far as 
I know we don't have a fallback mechanism so it only works for previously empty 
blobstores )

>From a technical perspective : 

It sounds like we could have a ZstdBlobStoreDao which composes with an 
underlying BlobStoreDao just like AESBlobStoreDAO, then the configuration 
layers would  compose the desired stack of DAOs

This satisfies both the first and second constraints. 
Which leaves the migration strategy choice: 
 - migration script
 - or backward read compatibility

h3. Migration script


I think the migration script can be written using existing james components 
similar to how I wrote the jpa to pg migration script ,including idempotency 
(ability to run several times on the same data, possibly skipping work on data 
which is already present in the target "bucket")

A migration script is tempting, an operator could : 
 - Create a new bucket for each bucket he wants to apply compression to,
 - Run the migration programm for all the objects of uncompressed bucket 
applying compression to the objects but preserving object blobids
 - Once the migration is over change the server configuration and start reading 
the compressed data.

This has a couple issues to consider : 
 - do we plan the script to be able to decrypt existing encrypted content then 
compressing and reencrypting them before writing to the new bucket
 - how do we handle writes that occur while the migration runs ( additions and 
deletions ) ?
 - can this strategy apply to non s3 store

h3. Backward read compatibility

somehow encode the compression status when writing the data. When reading look 
for the compression header, 
 - if the header is missing read without compression
 - if the header is present but the decompression fails, read without 
compression (assume data was actually uncompressed)
 - if the header is present return decompressed data

This avoids the need for changing the blob store DAO signatures (it's a lot of 
work), we can make underlying implementations aware of the configuration and 
have them use store specific metadata to document that a specific record is 
compressed or not (to help with non james tooling for example)
In such a scenario we could even imagine a migration script which migrates data 
in place (but risk some transient read errors if the corresponding james 
server(s) remains live during the migration since reads and writes are not 
necessarily atomic especially on an S3 store)

> S3 object compression
> ---------------------
>
>                 Key: JAMES-4182
>                 URL: https://issues.apache.org/jira/browse/JAMES-4182
>             Project: James Server
>          Issue Type: Improvement
>          Components: s3
>            Reporter: Benoit Tellier
>            Priority: Major
>
> h3. Why?
> As a James operator I want to do money savings on my S3 cloud bill.
> Operating a mail solution I notice 50 of the cloud cost is S3 storage.
> Garbage collection + data tiering is in place (attachment tiering for what s3 
> is concerned) but I want to have further mechanisms at hand to reduce the 
> bill.
> On the gains side:
>  - Attachment payload is mostly compressed - this is mitigated by attachment 
> tiering
>  - Mime is base 64 encoded - we can expect a minimum compression ratio of 30%
>  - On LINAGORA workload I did exhibit a compression ratio of ~0.55
> h3. What?
> Optional ZSTD compression in James ObjectStores.
> I wishes a fully retro-compatible mechanisms that only uncompress compressed 
> data, ideally using `content-encoding` metadata on the s3 object as a 
> compression marker.
> Also I wishes:
>  - A size threshold (16KB by default)
>  - A minimum compression ratio (defaulting to 1 - only compress if meaningful)
> In `blob.properties`:
> {code:java}
> compression.enabled=true
> compression.size.threshold=16K
> compression.min.ratio=0.8
> {code}
> I propose to implement this directly onto the s3BlobStoreDAO. Sadly the 
> BlobStoreDAO abstraction misses the needed abstractions to have metadata to 
> know if the data had been compressed or not.
> h3. Bringing the design even further
> What I will actually *really* do is :
>  - never compress in James
>  - leverage an external java code to compress data of older generations (> 1 
> month)
>  - and have james serve transparently older compress data
> With this:
>  - Most of the data is stored compressed - massive storage gains.
>  - Data compression is fully asynchronous
>  - 90% minimum of the read traffic is served uncompressed
> h3. Alternatives
> The most controversial part of the proposal is not to do this by composing 
> the BlobStoreDAO - I actually misses a `metadata` concept to do this.
> We *could* bring this metadata (Map<String, String>) to the BlobStoreDAO
> I would propose something like this for the data model:
> {code:java}
> sealed interface Blob {
>     Map<String, String> metadata();
>     // Have the POJOs encode some conversions ?
>     InputStreamBlob payloadAsStream();
>     ByteBlob asBytes();
>     ByteSourceBlob asByteSource(); 
> }
> record BytesBlob {...}
> record InputStreamBlob {...}
> record ByteSourceBlob {...}
> record StringBlob {...}
> record ByteFluxBlox {...} // Flux<ByteBuffer>
> {code}
> We could then refine the BlobStroeDAO interface:
> {code:java}
> public interface BlobStoreDAO {
>     // implementations to pattern match on data!
>     Publisher<Void> save(BucketName bucketName, BlobId blobId, Blob data); 
>     Publisher<BytesBlob> readBytes(BucketName bucketName, BlobId blobId);
>     Publisher<InputStreamBlob> readReactive(BucketName bucketName, BlobId 
> blobId);
>     InputStreamBlob read(BucketName bucketName, BlobId blobId);
>     // delete* + list* methods unchanged
> } 
> {code}
> Please note that implems that do not support metadata (file, cassandra) shall 
> THROW.
> Upsides:
>  - independant from S3: we do not make s3 code more complex we compose over it
>  - independant from s3: we actually could reuse this for other blob stores 
> (if any)
>  - benefit of it for encryption: to benefit from compression, we need to 
> compress then encrypt. encrypt then compress yield zero benefit. By having 
> compression a s3 concern we would be forced to encrypt then compress.
> Downside: major refactoring needed...
> Opinions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to