[james-project] 02/02: JAMES-2921 ADR for ObjectStorage usage improvements

rcordier Mon, 25 Nov 2019 02:48:08 -0800

This is an automated email from the ASF dual-hosted git repository.

rcordier pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/james-project.git


commit 31bbd652c1550f6a6bfad725709d05dff285ad26
Author: Benoit Tellier <[email protected]>
AuthorDate: Thu Oct 17 09:28:18 2019 +0700

    JAMES-2921 ADR for ObjectStorage usage improvements
---
 src/adr/0014-blobstore-storage-policies.md | 57 +++++++++++++++++++++++++++++
 src/adr/0015-objectstorage-blobid-list.md  | 59 ++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)

diff --git a/src/adr/0014-blobstore-storage-policies.md 
b/src/adr/0014-blobstore-storage-policies.md
new file mode 100644
index 0000000..655b1eb
--- /dev/null
+++ b/src/adr/0014-blobstore-storage-policies.md
@@ -0,0 +1,57 @@
+# 14. Add storage policies for BlobStore
+
+Date: 2019-10-09
+
+## Status
+
+Proposed
+
+Adoption needs to be backed by some performance tests, as well as data 
repartition between Cassandra and object storage shifts.
+
+## Context
+
+James exposes a simple BlobStore API for storing raw data. However such raw 
data often vary in size and access patterns.
+
+As an example:
+
+ - Mailbox message headers are expected to be small and frequently accessed
+ - Mailbox message body are expected to have sizes ranging from small to big 
but are unfrequently accessed
+ - DeletedMessageVault message headers are expected to be small and 
unfrequently accessed
+
+Also, the capabilities of the various implementations of BlobStore have 
different strengths:
+
+ - CassandraBlobStore is efficient for small blobs and offers low latency. 
However it is known to be expensive for big blobs. Cassandra storage is 
expensive.
+ - Object Storage blob store is good at storing big blobs, but it induces 
higher latencies than Cassandra for small blobs for a cost gain that isn't 
worth it.
+
+Thus, significant performance and cost ratio refinement could be unlocked by 
using the right blob store for the right blob.
+
+## Decision
+
+Introduce StoragePolicies at the level of the BlobStore API.
+
+The proposed policies include:
+
+ - SizeBasedStoragePolicy: The blob underlying storage medium will be chosen 
depending on its size.
+ - LowCostStoragePolicy: The blob is expected to be saved in low cost storage. 
Access is expected to be unfrequent.
+ - PerformantStoragePolicy: The blob is expected to be saved in performant 
storage. Access is expected to be frequent.
+
+An HybridBlobStore will replace current UnionBlobStore and will allow to 
choose between Cassandra and ObjectStorage implementations depending on the 
policies.
+
+DeletedMessageVault, BlobExport & MailRepository will rely on 
LowCostStoragePolicy. Other BlobStore users will rely on SizeBasedStoragePolicy.
+
+Some performance tests will be run in order to evaluate the improvements.
+
+## Consequences
+
+We expect small frequently accessed blobs to be located in Cassandra, allowing 
ObjectStorage to be used mainly for large costly blobs.
+
+In case of a less than 5% improvement, the code will not be added to the 
codebase and the proposal will get the status 'rejected'.
+
+We expect more data to be stored in Cassandra. We need to quantify this for 
adoption.
+
+As reads will be reading the two blobStores, no migration is required to use 
this composite blobstore on top an existing implementation,
+however we will benefits of the performance enhancements only for newly stored 
blobs.
+
+## References
+
+ - [JIRA](https://issues.apache.org/jira/browse/JAMES-2921)
\ No newline at end of file
diff --git a/src/adr/0015-objectstorage-blobid-list.md 
b/src/adr/0015-objectstorage-blobid-list.md
new file mode 100644
index 0000000..ef1523b
--- /dev/null
+++ b/src/adr/0015-objectstorage-blobid-list.md
@@ -0,0 +1,59 @@
+# 15. Persist BlobIds for avoiding persisting several time the same blobs 
within ObjectStorage
+
+Date: 2019-10-09
+
+## Status
+
+Proposed
+
+Adoption needs to be backed by some performance tests.
+
+## Context
+
+A given mail is often written to the blob store by different components. And 
mail traffic is heavily duplicated (several recipients receiving similar email, 
same attachments). This causes a given blob to often be persisted several times.
+
+Cassandra was the first implementation of the blobStore. Cassandra is a 
heavily write optimized NoSQL database. One can assume writes to be fast on top 
of Cassandra. Thus we assumed we could always overwrite blobs.
+
+This usage pattern was also adopted for BlobStore on top of ObjectStorage.
+
+However writing in Object storage:
+ - Takes time
+ - Is billed by most cloud providers
+
+Thus choosing a right strategy to avoid writing blob twice is desirable.
+
+However, ObjectStorage (OpenStack Swift) `exist` method was not efficient 
enough to be a real cost and performance saver.
+
+## Decision
+
+Rely on a StoredBlobIdsList API to know which blob is persisted or not in 
object storage. Provide a Cassandra implementation of it. 
+Located in blob-api for convenience, this it not a top level API. It is 
intended to be used by some blobStore implementations
+(here only ObjectStorage). We will provide a CassandraStoredBlobIdsList in 
blob-cassandra project so that guice products combining
+object storage and Cassandra can define a binding to it. 
+
+ - When saving a blob with precomputed blobId, we can check the existence of 
the blob in storage, avoiding possibly the expensive "save".
+ - When saving a blob too big to precompute its blobId, once the blob had been 
streamed using a temporary random blobId, copy operation can be avoided and the 
temporary blob could be directly removed.
+
+Cassandra is probably faster doing "write every time" rather than "read before 
write" so we should not use the stored blob projection for it
+
+Some performance tests will be run in order to evaluate the improvements.
+
+## Consequences
+
+We expect to reduce the amount of writes to the object storage. This is 
expected to improve:
+ - operational costs on cloud providers
+ - performance improvement
+ - latency reduction under load
+
+As id persistence in StoredBlobIdsList will be done once the blob successfully 
saved, inconsistencies in StoredBlobIdsList
+will lead to duplicated saved blobs, which is the current behaviour.
+
+In case of a less than 5% improvement, the code will not be added to the 
codebase and the proposal will get the status 'rejected'.
+
+## Reference
+
+Previous optimization proposal using blob existence checks before persist. 
This work was done using ObjectStorage exist method and was prooven not 
efficient enough.
+
+https://github.com/linagora/james-project/pull/2011 (V2)
+
+ - [JIRA](https://issues.apache.org/jira/browse/JAMES-2921)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[james-project] 02/02: JAMES-2921 ADR for ObjectStorage usage improvements

Reply via email to