Re: Distributed James: make ElasticSearch indexing optional?

Raphaël Ouazana-Sustowski Fri, 12 Jun 2020 09:30:04 -0700

Hello Matthieu,

Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :

Hi Raphael,


My answers below

On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:

Hi,

Here is a proposal to make ElasticSearch optional in our distributed
product/flavor/server.

Comments are welcome.


## Why?

Some people have expressed the need of using a distributed James
without
ElasticSearch:
- in some comment here:
https://issues.apache.org/jira/browse/JAMES-3086

I read that people asking they are "not using search". I'm very curious
about that: what does it mean to have a mail server with either IMAP
and/or JMAP without using any search ?

As far as I know, the main IMAP RFC requires some search support.

Are they removing the `SearchProcessor` from the IMAP server and return
errors to their clients? Do they expect that no user will ever hit the
search button of their MUA?

I see many use cases where you would not need search, essentially basedon automatic mail processing, which is a common James workflow.


They complain about ElasticSearch indexing being slow (or one could
also say expensive): wait until they do a full-scan search of users
inbox (:

I'm ok to have solutions for a different "upfront indexing
cost"/"search performance" ratio but not to propose a distributed
server relying on doing a full-scan of cassandra for every incoming
search.

We have to be open to custom usages of James and make it possible for
developers to remove some features they don't need. But I'm not
convinced a user should be able to do that with an configuration
option.

That's not only because of indexing being slow, it's also to get rid ofthe whole ElasticSearch cluster.

- one of our customers plan to deploy a distributed James server for
serving POP3 encrypted emails. This deployment does not rely on
searching features. However as part of current Distributed James
server
he is forced to rely on ElasticSearch email indexing.

This results in wasted resources as maintaining an ElasticSearch
cluster
to keep up with the volume is expensive.
Maintaining an ElasticSearch cluster when not needed is costly at
several levels:
- cost of infrastructure to deploy it
- cost of people having to maintain it
- performance cost on James to unnecessarily index data

Meanwhile they also pay the price for IMAP and JMAP data indexing in
mailbox code (generating ids that are never consumed but using
cassandra LWT, same for modseq, projections in various tables that are
never read, etc).

And while they can easily disable such protocols they will still have a
IMAP/JMAP server (with associated cost) serving only POP3.

Does disabling only ES in that context makes sense at all for the
Distributed James *product*?

Shouldn't we craft a specific Distributed SMTP+POP product instead that
would remove all wastes?

It makes sense because it allows to easily go back from oneconfiguration to the other. Going back and forth between scanningimplementation and ES one is pretty easy.

Having a new (potentially optimized) product could be great in somecases, but would totally go against this.

## How ?

Scanning search is a search implementation that is running on top of
any
mailbox implementation, even distributed ones and does not require
to
index data.

Scanning Search is tested both at the component level (unit test)

With 38 disabled testcases

but
also passes IMAP (MPT) tests on top of Cassandra implementation,
  as well
as JMAP memory tests, thus delivers correct results. Of course it
does
not support full text search.

We should allow Distributed James to optionally rely on scanning
search
instead of ElasticSearch.

   - Scanning search should be advised for deployments rarely
searching data
   - ElasticSearch should be advised when search is frequent or
requires
high performance

We could use module choosing [1] to choose between scanning search
and
ElasticSearch.

To be noted that scanning search introduces no other dependencies as
it
is part of mailbox-store thus causes no risk of library clashes.

To be noted also that metric collection and log collection using
ElasticSearch is unaffected.

You don't mentionned what will happen in the case of a search: we are
probably going to read full mails for the searched mailbox or even for
a given user in case of a multi-mailboxes search to find relevant
emails.

For a user with 10GiB of emails, it will for sure timeout and will
probably bring the whole cluster on its knees.

I don't find the scanning search relevant.

There is no case of search. If there is one, the only thing to do is toadd an ES cluster and change the configuration option.

## Alternative

The alternative would be to build a different product/flavor/server
than
the distributed one, where the only difference with the distributed
one
is that indexing will rely on scanning instead of ElasticSearch.

The maintenance cost of such a product/flavor/server is higher than
of a
configuration option (Docker images to release, time and energy to
run
integration tests on it).

Such a product/flavor is hard to brand because even if it answers a
need, it is not so far of the distributed one, and does not answer
needs
that are very far from it neither.

The advantage is that is would allow to more fine tune this solution
to
answer to the exact needs.


Another alternative would be:

* implement a SMTP+POP3 product where we can progressively remove the
unneeded parts as we did for the SMTP-only product

* throttle the indexing to limit the impact of this process when
receiving a lot of mails (at the cost of having a search index not so
up-to-date)

* be able to configure what is indexed (if we drop attachment indexing
and full-text indexing we'll probably be way faster)

It's just an example of what we could do and there are a lot of other
solutions.

I'm convinced both use cases are really differents and you put them
together because the solution to one problem happens to somehow solve
another issue at once.

I propose to focus on the use case that is the most important right now
and to search for solutions regardless of other issues we may have.

Or at least discuss these issues in two threads.

What do you think?

We have a simple solution allowing to configure a specific use case in asupported and wide used product. Is it better than a specific solutionin a product which is harder to define for users and is probably notenough of interest to be really well maintained?


Cheers,

Raphaël.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed James: make ElasticSearch indexing optional?

Reply via email to