Benoit Tellier created JAMES-4182:
-------------------------------------

             Summary: S3 object compression
                 Key: JAMES-4182
                 URL: https://issues.apache.org/jira/browse/JAMES-4182
             Project: James Server
          Issue Type: Improvement
          Components: s3
            Reporter: Benoit Tellier


h3. Why?

As a James operator I want to do money savings on my S3 cloud bill.

Operating a mail solution I notice 50 of the cloud cost is S3 storage.

Garbage collection + data tiering is in place (attachment tiering for what s3 
is concerned) but I want to have further mechanisms at hand to reduce the bill.

On the gains side:
 - Attachment payload is mostly compressed - this is mitigated by attachment 
tiering
 - Mime is base 64 encoded - we can expect a minimum compression ratio of 30%
 - On LINAGORA workload I did exhibit a compression ratio of ~0.55

h3. What?

Optional ZSTD compression in James ObjectStores.

I wishes a fully retro-compatible mechanisms that only uncompress compressed 
data, ideally using `content-encoding` metadata on the s3 object as a 
compression marker.

Also I wishes:
 - A size threshold (16KB by default)
 - A minimum compression ratio (defaulting to 1 - only compress if meaningful)

In `blob.properties`:

{code:java}
compression.enabled=true
compression.size.threshold=16K
compression.min.ratio=0.8
{code}

I propose to implement this directly onto the s3BlobStoreDAO. Sadly the 
BlobStoreDAO abstraction misses the needed abstractions to have metadata to 
know if the data had been compressed or not.

h3. Bringing the design even further

What I will actually *really* do is :
 - never compress in James
 - leverage an external java code to compress data of older generations (> 1 
month)
 - and have james serve transparently older compress data

With this:
 - Most of the data is stored compressed - massive storage gains.
 - Data compression is fully asynchronous
 - 90% minimum of the read traffic is served uncompressed

h3. Alternatives

The most controversial part of the proposal is not to do this by composing the 
BlobStoreDAO - I actually misses a `metadata` concept to do this.

We *could* bring this metadata (Map<String, String>) to the BlobStoreDAO

I would propose something like this for the data model:

{code:java}
sealed interface Blob {
    Map<String, String> metadata();

    // Have the POJOs encode some conversions ?
    InputStreamBlob payloadAsStream();
    ByteBlob asBytes();
    ByteSourceBlob asByteSource(); 
}
record BytesBlob {...}
record InputStreamBlob {...}
record ByteSourceBlob {...}
record StringBlob {...}
record ByteFluxBlox {...} // Flux<ByteBuffer>
{code}

We could then refine the BlobStroeDAO interface:

{code:java}
public interface BlobStoreDAO {
    // implementations to pattern match on data!
    Publisher<Void> save(BucketName bucketName, BlobId blobId, Blob data); 

    Publisher<BytesBlob> readBytes(BucketName bucketName, BlobId blobId);
    Publisher<InputStreamBlob> readReactive(BucketName bucketName, BlobId 
blobId);
    InputStreamBlob read(BucketName bucketName, BlobId blobId);

    // delete* + list* methods unchanged
} 
{code}

Please note that implems that do not support metadata (file, cassandra) shall 
THROW.

Upsides:
 - independant from S3: we do not make s3 code more complex we compose over it
 - independant from s3: we actually could reuse this for other blob stores (if 
any)
 - benefit of it for encryption: to benefit from compression, we need to 
compress then encrypt. encrypt then compress yield zero benefit. By having 
compression a s3 concern we would be forced to encrypt then compress.

Downside: major refactoring needed...

Opinions?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to