This is an automated email from the ASF dual-hosted git repository. rcordier pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/james-project.git
commit 31bbd652c1550f6a6bfad725709d05dff285ad26 Author: Benoit Tellier <[email protected]> AuthorDate: Thu Oct 17 09:28:18 2019 +0700 JAMES-2921 ADR for ObjectStorage usage improvements --- src/adr/0014-blobstore-storage-policies.md | 57 +++++++++++++++++++++++++++++ src/adr/0015-objectstorage-blobid-list.md | 59 ++++++++++++++++++++++++++++++ 2 files changed, 116 insertions(+) diff --git a/src/adr/0014-blobstore-storage-policies.md b/src/adr/0014-blobstore-storage-policies.md new file mode 100644 index 0000000..655b1eb --- /dev/null +++ b/src/adr/0014-blobstore-storage-policies.md @@ -0,0 +1,57 @@ +# 14. Add storage policies for BlobStore + +Date: 2019-10-09 + +## Status + +Proposed + +Adoption needs to be backed by some performance tests, as well as data repartition between Cassandra and object storage shifts. + +## Context + +James exposes a simple BlobStore API for storing raw data. However such raw data often vary in size and access patterns. + +As an example: + + - Mailbox message headers are expected to be small and frequently accessed + - Mailbox message body are expected to have sizes ranging from small to big but are unfrequently accessed + - DeletedMessageVault message headers are expected to be small and unfrequently accessed + +Also, the capabilities of the various implementations of BlobStore have different strengths: + + - CassandraBlobStore is efficient for small blobs and offers low latency. However it is known to be expensive for big blobs. Cassandra storage is expensive. + - Object Storage blob store is good at storing big blobs, but it induces higher latencies than Cassandra for small blobs for a cost gain that isn't worth it. + +Thus, significant performance and cost ratio refinement could be unlocked by using the right blob store for the right blob. + +## Decision + +Introduce StoragePolicies at the level of the BlobStore API. + +The proposed policies include: + + - SizeBasedStoragePolicy: The blob underlying storage medium will be chosen depending on its size. + - LowCostStoragePolicy: The blob is expected to be saved in low cost storage. Access is expected to be unfrequent. + - PerformantStoragePolicy: The blob is expected to be saved in performant storage. Access is expected to be frequent. + +An HybridBlobStore will replace current UnionBlobStore and will allow to choose between Cassandra and ObjectStorage implementations depending on the policies. + +DeletedMessageVault, BlobExport & MailRepository will rely on LowCostStoragePolicy. Other BlobStore users will rely on SizeBasedStoragePolicy. + +Some performance tests will be run in order to evaluate the improvements. + +## Consequences + +We expect small frequently accessed blobs to be located in Cassandra, allowing ObjectStorage to be used mainly for large costly blobs. + +In case of a less than 5% improvement, the code will not be added to the codebase and the proposal will get the status 'rejected'. + +We expect more data to be stored in Cassandra. We need to quantify this for adoption. + +As reads will be reading the two blobStores, no migration is required to use this composite blobstore on top an existing implementation, +however we will benefits of the performance enhancements only for newly stored blobs. + +## References + + - [JIRA](https://issues.apache.org/jira/browse/JAMES-2921) \ No newline at end of file diff --git a/src/adr/0015-objectstorage-blobid-list.md b/src/adr/0015-objectstorage-blobid-list.md new file mode 100644 index 0000000..ef1523b --- /dev/null +++ b/src/adr/0015-objectstorage-blobid-list.md @@ -0,0 +1,59 @@ +# 15. Persist BlobIds for avoiding persisting several time the same blobs within ObjectStorage + +Date: 2019-10-09 + +## Status + +Proposed + +Adoption needs to be backed by some performance tests. + +## Context + +A given mail is often written to the blob store by different components. And mail traffic is heavily duplicated (several recipients receiving similar email, same attachments). This causes a given blob to often be persisted several times. + +Cassandra was the first implementation of the blobStore. Cassandra is a heavily write optimized NoSQL database. One can assume writes to be fast on top of Cassandra. Thus we assumed we could always overwrite blobs. + +This usage pattern was also adopted for BlobStore on top of ObjectStorage. + +However writing in Object storage: + - Takes time + - Is billed by most cloud providers + +Thus choosing a right strategy to avoid writing blob twice is desirable. + +However, ObjectStorage (OpenStack Swift) `exist` method was not efficient enough to be a real cost and performance saver. + +## Decision + +Rely on a StoredBlobIdsList API to know which blob is persisted or not in object storage. Provide a Cassandra implementation of it. +Located in blob-api for convenience, this it not a top level API. It is intended to be used by some blobStore implementations +(here only ObjectStorage). We will provide a CassandraStoredBlobIdsList in blob-cassandra project so that guice products combining +object storage and Cassandra can define a binding to it. + + - When saving a blob with precomputed blobId, we can check the existence of the blob in storage, avoiding possibly the expensive "save". + - When saving a blob too big to precompute its blobId, once the blob had been streamed using a temporary random blobId, copy operation can be avoided and the temporary blob could be directly removed. + +Cassandra is probably faster doing "write every time" rather than "read before write" so we should not use the stored blob projection for it + +Some performance tests will be run in order to evaluate the improvements. + +## Consequences + +We expect to reduce the amount of writes to the object storage. This is expected to improve: + - operational costs on cloud providers + - performance improvement + - latency reduction under load + +As id persistence in StoredBlobIdsList will be done once the blob successfully saved, inconsistencies in StoredBlobIdsList +will lead to duplicated saved blobs, which is the current behaviour. + +In case of a less than 5% improvement, the code will not be added to the codebase and the proposal will get the status 'rejected'. + +## Reference + +Previous optimization proposal using blob existence checks before persist. This work was done using ObjectStorage exist method and was prooven not efficient enough. + +https://github.com/linagora/james-project/pull/2011 (V2) + + - [JIRA](https://issues.apache.org/jira/browse/JAMES-2921) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
