Re: [VOTE] Proposal to adopt ADR-76 Deleted message vault should use a single bucket

Jean Helou Tue, 20 Jan 2026 13:47:15 -0800

Hello,

An interesting discussion which helps me understand better the state of S3
storage in james


This is the case for
>
>  - mailbox  - RabbitMQ mail queue - Cassandra + Postgres mail repositories
>  - Attachments
> They share the default bucket in order to be deduplicated.
> This cross-feature dedulication is explicitly desired CF
> https://github.com/apache/james-project/blob/aef61d6fc9655698a8bff1521398ff110388e2d1/server/apps/distributed-app/src/test/java/org/apache/james/WithCassandraDeduplicationBlobStoreTest.java#L113
>


While it is currently implemented this way, I don't think deduplicating
accross use cases with different storage expectations is such a good idea
mailqueues are transient storage ( a few days ) while mailboxes are long
term storage ( until the heat death of the universe if my old gmail is any
indication )
mailrepositories I consider transient too but I may be missing use cases. (
all I can think of is spooling and dead letters/error neither of which I
would expect to store for more than a few weeks

Also there is 6th use case for the DEFAULT BlobStore bucket : the
mailrepository-blob which we introduced with matthieu starting may 2022
and  july 2024

> The current choice is either a full s3 bucket per feature or a single s3>
> bucket for all blobstore buckets.> - no blob.properties prefix, james uses
> one top level buckets per feature> - blob.properties prefix : james stores
> EVERY blob in a single bucket
> To be faily honest there's nothing harder to migrate than a S3 naming
> layout, I'm reluctant to changes affecting it.
>

The ADR should aim to guide future S3 naming layouts, we can document
legacy exceptions but ensure our future is brighter.


> Furthermore I believe we would be better served being more declarative and
> have each feature explicitly declare its buckets (and object name prefix?)
>

Its S3 bucket or its Blobstore bucket ? ( sorry I'm really asking, this is
not me being pedantic)
Not sure what the object name prefix you mention is : if there is an object
name prefix wouldn't that defeat deduplication strategies entirely ?
Do you mean introducing a `objectstorage.objectPrefix` that would somehow
interact with `objectstorage.namespace` to allow for nesting our BlobStore
buckets under a single S3 bucket ?

Changing the meaning of blob.properties objectstorage.bucketPrefix (
> https://github.com/apache/james-project/blob/aef61d6fc9655698a8bff1521398ff110388e2d1/server/apps/distributed-app/sample-configuration/blob.properties#L66
>  )
> would create even more confusion on the topic and so adding yet another
> "prefix" concept in this file.
>
> I understand operating off 3 buckets might not be ideal but it is likely
> to still be very manageable no?


Indeed I was, once again, misled by the naming of the configuration
parameters and the unexpected hardcoded naming scheme. And it seems
reviewers were confused enough to let the mailrepository-blob use the
default bucket too even though it doesn't make much sense either.

There is the "DEFAULT" BlobStore bucket which is named "default" in the
code thus trying to use a top level s3 bucket name "default" out of the
box. This mapping can be controlled by using the `objectstorage.namespace`
property which allows to bind the "DEFAULT" Blobstore bucket to an
arbitrary S3 bucket name.

There is also an `objectstorage.bucketPrefix` property which has no effect
whatsoever (if I understand correctly) on the mapping of the "DEFAULT"
BlobStore bucket but partially controls the names of the remaining
BlobStore buckets. The `objectstorage.namespace` property has no effect at
all on these buckets ( which still confuses the hell out of me even as I
write this message ).

Note that the `objectstorage.bucketPrefix` cannot contain a path separator
which is not checked until the application blows in your face because the
S3 connector refuses an invalid bucketname leaving a very confused user
(yes this happened to me as I couldn't give bucket creation rigths to my
james user and had to use pre-existing buckets so I tried a lot of things)

We have
- `jmap-uploads` which is necessarily a top level S3 bucket ( it can only
be influenced by `objectstorage.bucketPrefix`) I'm not exacly sure what the
upload-{ww} UploadBucketName is but it seems specific to the cassandra
implementation and the name of which cannot be configured apart from the
prefix.
- deleted-messages-{YYYY-MM} which is necessarily a set of dynamic toplevel
buckets possibly with a unique prefix to avoid collision but requiring
elevated application permissions in order to allow creating the buckets
when the time marker overflows. Which better explains the current PR

I'm not familiar with JMAP uploads so I can only make a guess as to what a
proper storage policy would be:  the appended week number suggests that the
objects are rather long lived. In the context of email, the name and long
lived object interrogates as to why we wouldn't want them deduplicated
with, let's say,  Attachments as a guess.  Something which is effectively
impossible today since we can't configure physical bucket storage per
BlobStoreBucket use case.

So indeed, today,  james has 3 S3 buckets mapped to 3 BlobStore buckets
including the DEFAULT BlobStore bucket which is used by 6 different
usecases.
I haven't checked if there is additional namespacing in the form of virtual
folders to try and avoid collisions between the use cases. I know
mailrepository-blob uses the mailrepository paths as virtual folders but
they are created at the toplevel of the bucket so listing my server's
bucket toplevel I see /var/mail/error which is not so great in hindsight (
I should probably have prepended a `/mailrepositories` prefix to get
`/mailrepositories/var/mail/error` in order to properly namespace the
feature within the bucket )

I have spent too long with security engineers lately and I can't help
wonder how this could be subverted to gain unauthorized access or to let an
attacked destroy data it shouldn't be able to access.


A path forward could be to :
- Stop binding to BlobStore.DEFAULT_BUCKET_NAME_QUALIFIER all over the
place and use configuration from each feature tell us what the desired
bucket name should be ( using "default" as default value to maintain
retrocompatibility ) so the guice bindings would target a
MAILREPOSITORY_BUCKET_NAME_QUALIFIER and a
DELETED_MESSAGE_VAULT_BUCKET_NAME_QUALIFIER and put the actual name there,
ready to be overriden by configuration instead of being hardcoded in the
features.
- Implement activable feature based namespacing beyond the S3 bucket name
so each feature can be mixed or isolated through configuration properties
allowing to retain the existing "everything in a single basket" mashup but
opening the door to better isolation

I'm not saying this should all be done in the initial deleted message vaul
PR but it could become the conclusion of the ADR and provide a refactoring
guide. As long as the configuration allows to keep using the existing S3
naming scheme ( both bucket names and paths ) the nightmare of migrating an
S3 naming layout would remain nicely mitigated in the closet.

What do you think ?

regards,
Jean


> Le janv. 19, 2026 12:16 PM, de Jean Helou <[email protected]>To sum up
> what my take of the discussions so far :
>
> On the specific case of the deleted message vault, the proposed
> implementation :
>
> Introduce a logical bucket (Blobstore) prefix to the time based structure.
> this allows storing all deleted messages in a single S3 bucket or combined
> with the underlying S3 prefix capability having a single toplevel
> "directory" for all these messages. All new messages will be written to the
> new structure, both the new and old addressing schemes will be readable.
> The old addressing scheme will be announced as deprecated but maintained
> for 5 years ( ~ about 3 or 4 major releases ?) ensuring every user who
> updates on a reasonable timescale will have a proper upgrade.
> The name of the logical blobstore bucket will be configurable, enabling
> users who don't use an S3 prefix to use the bucket of their choice for this
> use case.
>
> I agree this is better than the current implementation and vote in favor(
> +1 ).
> I think we should add comments to the fallkback specific code to flag the
> expected date of removal in the code, I for one guarantee that I will
> forget about it within such a long time span.
>
> With regards to the ADR : as it is currently worded I stil vote against.
> (-1) since the vote was initiated on both topics my compound vote must be
> -1 at least as it is worded today,
> My position is that :
> - blobstore bucket and s3 bucket should not be conflated
> - each feature should use a unique blobstore bucket
> - how blobstore buckets are stored in s3 buckets should be configurable
>
> I argue that some features share similar storage properties/requirement and
> could live in a single s3 bucket with a storage class policy while other
> features require different storage properties/requirements and could live
> in different s3 bucket (s) with their own different storage class policy,
> if we are going to have an ADR about bucket usage we should not ignore
> future features. while the fixed set of 3 buckets may be acceptable today,
> do we really want to force users to configure 10 different buckets in the
> future.
>
> The current choice is either a full s3 bucket per feature or a single s3
> bucket for all blobstore buckets.
> - no blob.properties prefix, james uses one top level buckets per feature
> - blob.properties prefix : james stores EVERY blob in a single bucket
>
> My proposal is to move forward with the implemetation of the deleted
> message vault but reboot the ADR discussion.
>
> Jean
>
>
> Le ven. 16 janv. 2026 à 09:14, Rene Cordier <[email protected]> a écrit
> :
>
> > Hello,
> >
> > Overall +1 on my side.
> >
> > Regarding the second topic, to add a few more details: technically the
> > blobstore deleted message vault does not change much in the end. There
> > was no new V2 implementation in the end (maybe likely should update the
> > ADR regarding this point). We identified that we could go forward in the
> > current vault code to add those modifications:
> >
> > - append is new, but we just write into the new single bucket so it does
> > not collide with the current version. We keep the old code around just
> > for testing (put the tag for it) the fallback, but it's not used anymore
> > outside of tests.
> >
> > - read and delete: the code does not change! as we do not change the
> > underlying design with metadatas, it actually works for old buckets and
> > the new single one without code mifications there.
> >
> > - purge task: that's where there is an old code / new code mixed
> > together. I did put a deprecated tag on the concerned old part. If we
> > remove it in +2 releases for example that should be alright? Or even
> > more. Honestly the fallback ain't too bad here. If you don't have old
> > buckets the old purge code part will just do nothing!
> >
> > Hope this helps clarify some points :)
> >
> > Regards,
> >
> > Rene.
> >
> > On 1/15/26 22:59, Benoit TELLIER wrote:
> > > Hello Jean
> > >
> > > Topic 1:
> > >
> > > I think an ADR structure rework discussion was started on the mailing
> > list already.
> > > Regarding this very example:
> > >   - The principal of operating off a smal number of fixed buckets is
> > likely an architecture decision.
> > >     Worth recording in an ADR. It could decide to define an
> architecture
> > principle.
> > >     Then in the *context* we could list offending features and plugins
> > and *consequences* list needed refactorings.
> > >   - Having an ADR presenting why we have a plugin to fix a generic
> > problem in the email world seems like a reasonnable choice to me. I often
> > end up, in architecture discussions (and no later than yesterday!) to get
> > discussion on this topic. Hence it feels justified to get a track record.
> > >
> > > Also this seems dangerous to me to say "not relevant, that's a plugin"
> > as everything (the mailbox, the event-bus, etc) is somehow a plugin with
> > guice binding and concern specific implementations and not others.
> > >
> > > Topic 2: Answers inlined.
> > >
> > > --
> > >
> > > Best regards,
> > >
> > > Benoit TELLIER
> > >
> > > General manager of Linagora VIETNAM.
> > > Product owner for Twake-Mail product.
> > > Chairman of the Apache James project.
> > >
> > > Mail: [email protected]
> > > Tel: (0033) 6 77 26 04 58 (WhatsApp, Signal)
> > >
> > >
> > > Le janv. 15, 2026 4:13 PM, de Jean Helou <[email protected]>Since you
> > called a formal vote I must vote -1 I may change my vote after
> > > the discussion has taken place.
> > >
> > > 2 topics on which I would like to see some discussion before voting :
> > >
> > > Topic 1 Why an ADR for an implementation detail in a plugin ?
> > > Maybe I'm just too new to james and/or need a refresher in what james
> > > considers and architectural decision.
> > > It feels to me that this should be part of the README for the plugin. I
> > see
> > > that introducing the feature (the deleted message store) was also done
> > > through an ADR which seems a bit suprising to me.
> > > I dont disagree on the content of the files themselves (thought the ADR
> > > formalism is a bit weird for a README), it just doesn't match with my
> > > understanding of what ADRs are.
> > > I personnally would prefer to merge the content of both files in a
> README
> > > file on the plugin itself (to document the why and the implementation
> > > details of the plugin) and list the plugin out somewhere in the
> > > documentation ( it probably already is) to let people know the feature
> > > exist and where to find details about it.
> > >
> > > On the plugin evolution itself
> > > The currenty behaviour is to create a full s3 bucket following the
> > pattern
> > > `deleted-messages-[year]-[month]-01` (it is unclear to me what the 01
> > > represents, is it the day ?)
> > >
> > > Means first day of the month and was adopted to get something that
> looks
> > like a date if I recall well/
> > >
> > >
> > >
> > >
> > > According to the change proposed in the "ADR", the plugin would now
> store
> > > the same contents under a virtual path within a single bucker `
> > > [year]/[month]/[blob_id]`
> > > The proposed migration strategy is :
> > > - - write only on the single bucket, fall back if necessary on old
> > buckets
> > > for read and delete
> > > - - add the single bucket usage case to the purge task, that would do
> > > cleaning on both new and old buckets.
> > > I think we should work out right now for how long we intend to maintain
> > the
> > > fallback and old behaviour in the clean task
> > >
> > >
> > > +1 typical retention is 1 year so going for 2 years seems reasonable.
> > Likely a bit more to further ease the upgrade path.
> > >
> > > Probably also clearly communicate (changelog comes to mind) on the
> > > deprecation of the old behaviour so we can eventually remove both the
> > > fallback and the cleaning task.
> > >
> > > +1
> > >
> > > However this has consequences on users ugrade path. As proposed they
> will
> > > have to install a version which has both the new behaviour and the
> > fallback
> > > for at least the retention period of all the deleted messages in their
> > > system.
> > >
> > > Would you consider providing a migration tool that allows them to move
> > > their deleted messages to the new scheme and fast forward on the
> > versions ?
> > > (maybe even skip as each versions embarks all the migrations from the
> > > previous versions IIRC)
> > >
> > >
> > > To be fair given it's simplicity I'd rather support the fallback 5
> years
> > rather than moving blobs around. Personal taste.
> > >
> > >
> > > I was away from home and had a small car crash so I didn't have time to
> > > look into 2902 yet. I had a quick look while writing this message and I
> > was
> > > suprised to see an API change that affects the cassandra implementation
> > and
> > > introduces something similar to a blobid factory (and used as such) but
> > > with a different type. I left a comment to that effect and will
> continue
> > > the review (probably tomorrow or during the weekend)
> > >
> > > I saw it and I will craft something around it, it seems like a relevant
> > remark.
> > >
> > >
> > > jean
> > >
> > > Le jeu. 15 janv. 2026 à 14:23, Benoit TELLIER <[email protected]>
> a
> > > écrit :
> > >
> > >> Hi all,
> > >>
> > >>
> > >>
> > >>
> > >> I would like to call a vote on the following change:
> > >>
> > >>   - github.com/apache/james-project/pull/2894 ADR: Deleted message
> > >> vault single bucket usage
> > >>
> > >>   - github.com/apache/james-project/pull/2902 which is the
> > >> implementation of the aforementioned ADR
> > >>
> > >>
> > >> The use of monthly buckets is problematic with certain S3 suppliers
> that
> > >> limit their count, and require extensive rights onto the object store
> > >> endpoints.
> > >> The proposal would address solely this problematic feature and not
> > attempt
> > >> to refactor in depth the way James' bucket are encoded onto the S3
> > endpoint.
> > >>
> > >> (More context is provided onto the relevant ADR)
> > >>
> > >>
> > >>
> > >> This vote is open for at least 72 hours and requires a simple
> majority.
> > >>
> > >>
> > >> Please vote:
> > >>
> > >>
> > >>   - +1 approve
> > >>
> > >>
> > >>   - 0 no opinion
> > >>
> > >>
> > >>   - -1 disapprove (please explain)
> > >>
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >>
> > >> Best regards,
> > >>
> > >> Benoit TELLIER
> > >>
> > >> General manager of Linagora VIETNAM.
> > >> Product owner for Twake-Mail product.
> > >> Chairman of the Apache James project.
> > >>
> > >> Mail: [email protected]
> > >> Tel: (0033) 6 77 26 04 58 (WhatsApp, Signal)
> > >>
> > >>
> > >>
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: [VOTE] Proposal to adopt ADR-76 Deleted message vault should use a single bucket

Reply via email to