[
https://issues.apache.org/jira/browse/JAMES-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061861#comment-18061861
]
Jean Helou commented on JAMES-4182:
-----------------------------------
Having compression sounds like a great idea !
The technical part of the proposal is not so clear to me. In the end I'm not
sure what exactly you propose to implement.
As I understand it the problem can be expressed as this set of constraints
* Compression is relevant for all BlobStore implementations
* One might argue that it's even more important for c* or sql storage since
storage prices for such systems are usually higher than for s3
* We want to compress before we encrypt
* Encryption generates random or pseudo random byte streams that don't
compress well
* We want to be able to enable compression on existing data stores which means
preserving backward compatible reads of uncompressed data
* This was not implemented at the time for the AESBlobStoreDAO ( As far as
I know we don't have a fallback mechanism so it only works for previously empty
blobstores )
>From a technical perspective :
It sounds like we could have a ZstdBlobStoreDao which composes with an
underlying BlobStoreDao just like AESBlobStoreDAO, then the configuration
layers would compose the desired stack of DAOs
This satisfies both the first and second constraints.
Which leaves the migration strategy choice:
- migration script
- or backward read compatibility
h3. Migration script
I think the migration script can be written using existing james components
similar to how I wrote the jpa to pg migration script ,including idempotency
(ability to run several times on the same data, possibly skipping work on data
which is already present in the target "bucket")
A migration script is tempting, an operator could :
- Create a new bucket for each bucket he wants to apply compression to,
- Run the migration programm for all the objects of uncompressed bucket
applying compression to the objects but preserving object blobids
- Once the migration is over change the server configuration and start reading
the compressed data.
This has a couple issues to consider :
- do we plan the script to be able to decrypt existing encrypted content then
compressing and reencrypting them before writing to the new bucket
- how do we handle writes that occur while the migration runs ( additions and
deletions ) ?
- can this strategy apply to non s3 store
h3. Backward read compatibility
somehow encode the compression status when writing the data. When reading look
for the compression header,
- if the header is missing read without compression
- if the header is present but the decompression fails, read without
compression (assume data was actually uncompressed)
- if the header is present return decompressed data
This avoids the need for changing the blob store DAO signatures (it's a lot of
work), we can make underlying implementations aware of the configuration and
have them use store specific metadata to document that a specific record is
compressed or not (to help with non james tooling for example)
In such a scenario we could even imagine a migration script which migrates data
in place (but risk some transient read errors if the corresponding james
server(s) remains live during the migration since reads and writes are not
necessarily atomic especially on an S3 store)
> S3 object compression
> ---------------------
>
> Key: JAMES-4182
> URL: https://issues.apache.org/jira/browse/JAMES-4182
> Project: James Server
> Issue Type: Improvement
> Components: s3
> Reporter: Benoit Tellier
> Priority: Major
>
> h3. Why?
> As a James operator I want to do money savings on my S3 cloud bill.
> Operating a mail solution I notice 50 of the cloud cost is S3 storage.
> Garbage collection + data tiering is in place (attachment tiering for what s3
> is concerned) but I want to have further mechanisms at hand to reduce the
> bill.
> On the gains side:
> - Attachment payload is mostly compressed - this is mitigated by attachment
> tiering
> - Mime is base 64 encoded - we can expect a minimum compression ratio of 30%
> - On LINAGORA workload I did exhibit a compression ratio of ~0.55
> h3. What?
> Optional ZSTD compression in James ObjectStores.
> I wishes a fully retro-compatible mechanisms that only uncompress compressed
> data, ideally using `content-encoding` metadata on the s3 object as a
> compression marker.
> Also I wishes:
> - A size threshold (16KB by default)
> - A minimum compression ratio (defaulting to 1 - only compress if meaningful)
> In `blob.properties`:
> {code:java}
> compression.enabled=true
> compression.size.threshold=16K
> compression.min.ratio=0.8
> {code}
> I propose to implement this directly onto the s3BlobStoreDAO. Sadly the
> BlobStoreDAO abstraction misses the needed abstractions to have metadata to
> know if the data had been compressed or not.
> h3. Bringing the design even further
> What I will actually *really* do is :
> - never compress in James
> - leverage an external java code to compress data of older generations (> 1
> month)
> - and have james serve transparently older compress data
> With this:
> - Most of the data is stored compressed - massive storage gains.
> - Data compression is fully asynchronous
> - 90% minimum of the read traffic is served uncompressed
> h3. Alternatives
> The most controversial part of the proposal is not to do this by composing
> the BlobStoreDAO - I actually misses a `metadata` concept to do this.
> We *could* bring this metadata (Map<String, String>) to the BlobStoreDAO
> I would propose something like this for the data model:
> {code:java}
> sealed interface Blob {
> Map<String, String> metadata();
> // Have the POJOs encode some conversions ?
> InputStreamBlob payloadAsStream();
> ByteBlob asBytes();
> ByteSourceBlob asByteSource();
> }
> record BytesBlob {...}
> record InputStreamBlob {...}
> record ByteSourceBlob {...}
> record StringBlob {...}
> record ByteFluxBlox {...} // Flux<ByteBuffer>
> {code}
> We could then refine the BlobStroeDAO interface:
> {code:java}
> public interface BlobStoreDAO {
> // implementations to pattern match on data!
> Publisher<Void> save(BucketName bucketName, BlobId blobId, Blob data);
> Publisher<BytesBlob> readBytes(BucketName bucketName, BlobId blobId);
> Publisher<InputStreamBlob> readReactive(BucketName bucketName, BlobId
> blobId);
> InputStreamBlob read(BucketName bucketName, BlobId blobId);
> // delete* + list* methods unchanged
> }
> {code}
> Please note that implems that do not support metadata (file, cassandra) shall
> THROW.
> Upsides:
> - independant from S3: we do not make s3 code more complex we compose over it
> - independant from s3: we actually could reuse this for other blob stores
> (if any)
> - benefit of it for encryption: to benefit from compression, we need to
> compress then encrypt. encrypt then compress yield zero benefit. By having
> compression a s3 concern we would be forced to encrypt then compress.
> Downside: major refactoring needed...
> Opinions?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]