Benoit Tellier created JAMES-4182:
-------------------------------------
Summary: S3 object compression
Key: JAMES-4182
URL: https://issues.apache.org/jira/browse/JAMES-4182
Project: James Server
Issue Type: Improvement
Components: s3
Reporter: Benoit Tellier
h3. Why?
As a James operator I want to do money savings on my S3 cloud bill.
Operating a mail solution I notice 50 of the cloud cost is S3 storage.
Garbage collection + data tiering is in place (attachment tiering for what s3
is concerned) but I want to have further mechanisms at hand to reduce the bill.
On the gains side:
- Attachment payload is mostly compressed - this is mitigated by attachment
tiering
- Mime is base 64 encoded - we can expect a minimum compression ratio of 30%
- On LINAGORA workload I did exhibit a compression ratio of ~0.55
h3. What?
Optional ZSTD compression in James ObjectStores.
I wishes a fully retro-compatible mechanisms that only uncompress compressed
data, ideally using `content-encoding` metadata on the s3 object as a
compression marker.
Also I wishes:
- A size threshold (16KB by default)
- A minimum compression ratio (defaulting to 1 - only compress if meaningful)
In `blob.properties`:
{code:java}
compression.enabled=true
compression.size.threshold=16K
compression.min.ratio=0.8
{code}
I propose to implement this directly onto the s3BlobStoreDAO. Sadly the
BlobStoreDAO abstraction misses the needed abstractions to have metadata to
know if the data had been compressed or not.
h3. Bringing the design even further
What I will actually *really* do is :
- never compress in James
- leverage an external java code to compress data of older generations (> 1
month)
- and have james serve transparently older compress data
With this:
- Most of the data is stored compressed - massive storage gains.
- Data compression is fully asynchronous
- 90% minimum of the read traffic is served uncompressed
h3. Alternatives
The most controversial part of the proposal is not to do this by composing the
BlobStoreDAO - I actually misses a `metadata` concept to do this.
We *could* bring this metadata (Map<String, String>) to the BlobStoreDAO
I would propose something like this for the data model:
{code:java}
sealed interface Blob {
Map<String, String> metadata();
// Have the POJOs encode some conversions ?
InputStreamBlob payloadAsStream();
ByteBlob asBytes();
ByteSourceBlob asByteSource();
}
record BytesBlob {...}
record InputStreamBlob {...}
record ByteSourceBlob {...}
record StringBlob {...}
record ByteFluxBlox {...} // Flux<ByteBuffer>
{code}
We could then refine the BlobStroeDAO interface:
{code:java}
public interface BlobStoreDAO {
// implementations to pattern match on data!
Publisher<Void> save(BucketName bucketName, BlobId blobId, Blob data);
Publisher<BytesBlob> readBytes(BucketName bucketName, BlobId blobId);
Publisher<InputStreamBlob> readReactive(BucketName bucketName, BlobId
blobId);
InputStreamBlob read(BucketName bucketName, BlobId blobId);
// delete* + list* methods unchanged
}
{code}
Please note that implems that do not support metadata (file, cassandra) shall
THROW.
Upsides:
- independant from S3: we do not make s3 code more complex we compose over it
- independant from s3: we actually could reuse this for other blob stores (if
any)
- benefit of it for encryption: to benefit from compression, we need to
compress then encrypt. encrypt then compress yield zero benefit. By having
compression a s3 concern we would be forced to encrypt then compress.
Downside: major refactoring needed...
Opinions?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]