Re: Distributed James: make ElasticSearch indexing optional?

Matthieu Baechler Fri, 12 Jun 2020 01:06:10 -0700

Hi Raphael,

My answers below

On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
> Hi,
> 
> Here is a proposal to make ElasticSearch optional in our distributed 
> product/flavor/server.
> 
> Comments are welcome.
> 
> 
> ## Why?
> 
> Some people have expressed the need of using a distributed James
> without 
> ElasticSearch:
> - in some comment here: 
> https://issues.apache.org/jira/browse/JAMES-3086

I read that people asking they are "not using search". I'm very curious
about that: what does it mean to have a mail server with either IMAP
and/or JMAP without using any search ?

As far as I know, the main IMAP RFC requires some search support. 

Are they removing the `SearchProcessor` from the IMAP server and return
errors to their clients? Do they expect that no user will ever hit the
search button of their MUA?

They complain about ElasticSearch indexing being slow (or one could
also say expensive): wait until they do a full-scan search of users
inbox (:

I'm ok to have solutions for a different "upfront indexing
cost"/"search performance" ratio but not to propose a distributed
server relying on doing a full-scan of cassandra for every incoming
search.

We have to be open to custom usages of James and make it possible for
developers to remove some features they don't need. But I'm not
convinced a user should be able to do that with an configuration
option.

> - one of our customers plan to deploy a distributed James server for 
> serving POP3 encrypted emails. This deployment does not rely on 
> searching features. However as part of current Distributed James
> server 
> he is forced to rely on ElasticSearch email indexing.
> 
> This results in wasted resources as maintaining an ElasticSearch
> cluster 
> to keep up with the volume is expensive.
> Maintaining an ElasticSearch cluster when not needed is costly at 
> several levels:
> - cost of infrastructure to deploy it
> - cost of people having to maintain it
> - performance cost on James to unnecessarily index data

Meanwhile they also pay the price for IMAP and JMAP data indexing in
mailbox code (generating ids that are never consumed but using
cassandra LWT, same for modseq, projections in various tables that are
never read, etc).

And while they can easily disable such protocols they will still have a
IMAP/JMAP server (with associated cost) serving only POP3.

Does disabling only ES in that context makes sense at all for the
Distributed James *product*?

Shouldn't we craft a specific Distributed SMTP+POP product instead that
would remove all wastes?

> ## How ?
> 
> Scanning search is a search implementation that is running on top of
> any 
> mailbox implementation, even distributed ones and does not require
> to 
> index data.
> 
> Scanning Search is tested both at the component level (unit test) 

With 38 disabled testcases

> but 
> also passes IMAP (MPT) tests on top of Cassandra implementation,
>  as well 
> as JMAP memory tests, thus delivers correct results. Of course it
> does 
> not support full text search.
> 
> We should allow Distributed James to optionally rely on scanning
> search 
> instead of ElasticSearch.
> 
>   - Scanning search should be advised for deployments rarely
> searching data
>   - ElasticSearch should be advised when search is frequent or
> requires 
> high performance
> 
> We could use module choosing [1] to choose between scanning search
> and 
> ElasticSearch.
> 
> To be noted that scanning search introduces no other dependencies as
> it 
> is part of mailbox-store thus causes no risk of library clashes.
> 
> To be noted also that metric collection and log collection using 
> ElasticSearch is unaffected.
> 

You don't mentionned what will happen in the case of a search: we are
probably going to read full mails for the searched mailbox or even for
a given user in case of a multi-mailboxes search to find relevant
emails.

For a user with 10GiB of emails, it will for sure timeout and will
probably bring the whole cluster on its knees.

I don't find the scanning search relevant.

> ## Alternative
> 
> The alternative would be to build a different product/flavor/server
> than 
> the distributed one, where the only difference with the distributed
> one 
> is that indexing will rely on scanning instead of ElasticSearch.
> 
> The maintenance cost of such a product/flavor/server is higher than
> of a 
> configuration option (Docker images to release, time and energy to
> run 
> integration tests on it).
> 
> Such a product/flavor is hard to brand because even if it answers a 
> need, it is not so far of the distributed one, and does not answer
> needs 
> that are very far from it neither.
> 
> The advantage is that is would allow to more fine tune this solution
> to 
> answer to the exact needs.
> 

Another alternative would be:

* implement a SMTP+POP3 product where we can progressively remove the
unneeded parts as we did for the SMTP-only product

* throttle the indexing to limit the impact of this process when
receiving a lot of mails (at the cost of having a search index not so
up-to-date)

* be able to configure what is indexed (if we drop attachment indexing
and full-text indexing we'll probably be way faster)

It's just an example of what we could do and there are a lot of other
solutions.

I'm convinced both use cases are really differents and you put them
together because the solution to one problem happens to somehow solve
another issue at once.

I propose to focus on the use case that is the most important right now
and to search for solutions regardless of other issues we may have.

Or at least discuss these issues in two threads.

What do you think?

-- Matthieu Baechler

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed James: make ElasticSearch indexing optional?

Reply via email to