[
https://issues.apache.org/jira/browse/JAMES-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061981#comment-18061981
]
Benoit Tellier commented on JAMES-4182:
---------------------------------------
> Having compression sounds like a great idea !
Thanks for enthousiasm!
> As I understand it the problem can be expressed as this set of constraints
Correct, with constraint 1 and 2 being 'optional' to me, and not achievable
with the first proposed design, but possible with the second one.
> ZstdBlobStoreDao
To be able to do this well, we need to have a set of metadata available (in
order to remember this is compressed).
Alternative could be to look for ZSTD headers but it's not clean as a ZSTD
attachment stored uncompressed ( so ZSTD x 1) would be served uncompressed - ie
looking as ZSTD headers only is not a bijection which is a nogo to me.
The hardest point with this design is *augnenting* the BlobStoreDAO interface
to have metadata alongside binary data and *remember* if uncompression is
needed.
I'm not sure this is any clearer but at least I tried...
> This satisfies both the first and second constraints.
The thord constraint is the most important to me
Not just for smooth migration (which at my scale is compulsory)
But also because it is a **feature** that allows for including compression in
**data tiering** strategy cf
https://james.staged.apache.org/james-project/3.9.0/servers/distributed/architecture/data-tiering.html
Also compressing non compressible / small data is bad, this blobstoreDAO would
mix compressed and uncompressed data by essence. Having a metadata for
remembering this is a must.
ZstdBlobStoreDAO on top of the existing interface is a NOGO to me.
> S3 object compression
> ---------------------
>
> Key: JAMES-4182
> URL: https://issues.apache.org/jira/browse/JAMES-4182
> Project: James Server
> Issue Type: Improvement
> Components: s3
> Reporter: Benoit Tellier
> Priority: Major
>
> h3. Why?
> As a James operator I want to do money savings on my S3 cloud bill.
> Operating a mail solution I notice 50 of the cloud cost is S3 storage.
> Garbage collection + data tiering is in place (attachment tiering for what s3
> is concerned) but I want to have further mechanisms at hand to reduce the
> bill.
> On the gains side:
> - Attachment payload is mostly compressed - this is mitigated by attachment
> tiering
> - Mime is base 64 encoded - we can expect a minimum compression ratio of 30%
> - On LINAGORA workload I did exhibit a compression ratio of ~0.55
> h3. What?
> Optional ZSTD compression in James ObjectStores.
> I wishes a fully retro-compatible mechanisms that only uncompress compressed
> data, ideally using `content-encoding` metadata on the s3 object as a
> compression marker.
> Also I wishes:
> - A size threshold (16KB by default)
> - A minimum compression ratio (defaulting to 1 - only compress if meaningful)
> In `blob.properties`:
> {code:java}
> compression.enabled=true
> compression.size.threshold=16K
> compression.min.ratio=0.8
> {code}
> I propose to implement this directly onto the s3BlobStoreDAO. Sadly the
> BlobStoreDAO abstraction misses the needed abstractions to have metadata to
> know if the data had been compressed or not.
> h3. Bringing the design even further
> What I will actually *really* do is :
> - never compress in James
> - leverage an external java code to compress data of older generations (> 1
> month)
> - and have james serve transparently older compress data
> With this:
> - Most of the data is stored compressed - massive storage gains.
> - Data compression is fully asynchronous
> - 90% minimum of the read traffic is served uncompressed
> h3. Alternatives
> The most controversial part of the proposal is not to do this by composing
> the BlobStoreDAO - I actually misses a `metadata` concept to do this.
> We *could* bring this metadata (Map<String, String>) to the BlobStoreDAO
> I would propose something like this for the data model:
> {code:java}
> sealed interface Blob {
> Map<String, String> metadata();
> // Have the POJOs encode some conversions ?
> InputStreamBlob payloadAsStream();
> ByteBlob asBytes();
> ByteSourceBlob asByteSource();
> }
> record BytesBlob {...}
> record InputStreamBlob {...}
> record ByteSourceBlob {...}
> record StringBlob {...}
> record ByteFluxBlox {...} // Flux<ByteBuffer>
> {code}
> We could then refine the BlobStroeDAO interface:
> {code:java}
> public interface BlobStoreDAO {
> // implementations to pattern match on data!
> Publisher<Void> save(BucketName bucketName, BlobId blobId, Blob data);
> Publisher<BytesBlob> readBytes(BucketName bucketName, BlobId blobId);
> Publisher<InputStreamBlob> readReactive(BucketName bucketName, BlobId
> blobId);
> InputStreamBlob read(BucketName bucketName, BlobId blobId);
> // delete* + list* methods unchanged
> }
> {code}
> Please note that implems that do not support metadata (file, cassandra) shall
> THROW.
> Upsides:
> - independant from S3: we do not make s3 code more complex we compose over it
> - independant from s3: we actually could reuse this for other blob stores
> (if any)
> - benefit of it for encryption: to benefit from compression, we need to
> compress then encrypt. encrypt then compress yield zero benefit. By having
> compression a s3 concern we would be forced to encrypt then compress.
> Downside: major refactoring needed...
> Opinions?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]