Re: Distributed James: make ElasticSearch indexing optional?

Raphaël Ouazana-Sustowski Mon, 15 Jun 2020 06:31:26 -0700

Hello,

Le 15/06/2020 à 09:52, Matthieu Baechler a écrit :

Hi Raphael,


On Fri, 2020-06-12 at 18:29 +0200, Raphaël Ouazana-Sustowski wrote:

Hello Matthieu,

Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :

Hi Raphael,

My answers below

On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:

Hi,

Here is a proposal to make ElasticSearch optional in our
distributed
product/flavor/server.

Comments are welcome.


## Why?

Some people have expressed the need of using a distributed James
without
ElasticSearch:
- in some comment here:
https://issues.apache.org/jira/browse/JAMES-3086

I read that people asking they are "not using search". I'm very
curious
about that: what does it mean to have a mail server with either
IMAP
and/or JMAP without using any search ?

As far as I know, the main IMAP RFC requires some search support.

Are they removing the `SearchProcessor` from the IMAP server and
return
errors to their clients? Do they expect that no user will ever hit
the
search button of their MUA?

I see many use cases where you would not need search, essentially
based
on automatic mail processing, which is a common James workflow.

Does it still make sense to support IMAP at this point? I'm almost sure
people would expect REST and/or MQ in this case, don't you think?

Standard vs non standard API? So yes it can make sense. I won't gofurther on this topic, because as you told it I don't know exactly theneed for such a workflow, so if people are interested please contributeto this discussion.

They complain about ElasticSearch indexing being slow (or one could
also say expensive): wait until they do a full-scan search of users
inbox (:

I'm ok to have solutions for a different "upfront indexing
cost"/"search performance" ratio but not to propose a distributed
server relying on doing a full-scan of cassandra for every incoming
search.

We have to be open to custom usages of James and make it possible
for
developers to remove some features they don't need. But I'm not
convinced a user should be able to do that with an configuration
option.

That's not only because of indexing being slow, it's also to get rid
of
the whole ElasticSearch cluster.

The link you provided, to which I refer when talking about indexing
being slow, is about indexing being slow as far as I understand. Let's
deal with these two cases in different threads to avoid confusion.



We did not read the same sentence:

"We don't use elasticsearch, why is it not possible to remove it?"

Again, I cannot go further on this point, I'm not the user complainingabout the presence of ElasticSearch.

- one of our customers plan to deploy a distributed James server
for
serving POP3 encrypted emails. This deployment does not rely on
searching features. However as part of current Distributed James
server
he is forced to rely on ElasticSearch email indexing.

This results in wasted resources as maintaining an ElasticSearch
cluster
to keep up with the volume is expensive.
Maintaining an ElasticSearch cluster when not needed is costly at
several levels:
- cost of infrastructure to deploy it
- cost of people having to maintain it
- performance cost on James to unnecessarily index data

Meanwhile they also pay the price for IMAP and JMAP data indexing
in
mailbox code (generating ids that are never consumed but using
cassandra LWT, same for modseq, projections in various tables that
are
never read, etc).

And while they can easily disable such protocols they will still
have a
IMAP/JMAP server (with associated cost) serving only POP3.

Does disabling only ES in that context makes sense at all for the
Distributed James *product*?

Shouldn't we craft a specific Distributed SMTP+POP product instead
that
would remove all wastes?

It makes sense because it allows to easily go back from one
configuration to the other. Going back and forth between scanning
implementation and ES one is pretty easy.

As long as you don't have real users with mails. How long will a full-
reindex (that is supposed to be slow according to user complains) take
with some Terabytes of emails? Is it what you call "easy"? Because
having a Distributed Mail Server without a huge amount of data doesn't
make much sense.

It depends, the Distributed Mail Server currently covers the use case ofhigh availability. So it can make sense outside of the big data world.


So, let's be realistic: this switch, while possible with some
configuration would be quite hard to handle properly in real world (it
requires at least some ops and active monitoring).

Having a new (potentially optimized) product could be great in some
cases, but would totally go against this.

Can we have arguments?

Bundling too many use cases in a single product is not very appealing
to me because I suspect it will become be too complex by doing too many
different things, confusing to user because we'll have to explain
carefully in which case a specific option make sense, hard to maintain,
because it's hard to make good choices when we can't figure out what
are our users, etc.

What's the difference between explaining a configuration option andexplaining which product to choose? From my point of view, one productis comfortable. You know you have some configuration options that cangive you such or such options. Several products make you do the rightchoices at the very beginning of the project, when you don't knowexactly your requirements to make the right choices.

## How ?

Scanning search is a search implementation that is running on top
of
any
mailbox implementation, even distributed ones and does not
require
to
index data.

Scanning Search is tested both at the component level (unit test)

With 38 disabled testcases

but
also passes IMAP (MPT) tests on top of Cassandra implementation,
   as well
as JMAP memory tests, thus delivers correct results. Of course it
does
not support full text search.

We should allow Distributed James to optionally rely on scanning
search
instead of ElasticSearch.

    - Scanning search should be advised for deployments rarely
searching data
    - ElasticSearch should be advised when search is frequent or
requires
high performance

We could use module choosing [1] to choose between scanning
search
and
ElasticSearch.

To be noted that scanning search introduces no other dependencies
as
it
is part of mailbox-store thus causes no risk of library clashes.

To be noted also that metric collection and log collection using
ElasticSearch is unaffected.

You don't mentionned what will happen in the case of a search: we
are
probably going to read full mails for the searched mailbox or even
for
a given user in case of a multi-mailboxes search to find relevant
emails.

For a user with 10GiB of emails, it will for sure timeout and will
probably bring the whole cluster on its knees.

I don't find the scanning search relevant.

There is no case of search.

I don't understand that sentence: what are you describing? I mean, one
can send a SEARCH command to IMAP and it would be served by the
scanning search, right?

A given workflow could use the IMAP protocol without SRCH. It's what Ihad in mind, but anyway that's suppositions about a workflow I don'tknow, so I cannot tell much about it.

  If there is one, the only thing to do is to
add an ES cluster and change the configuration option.

Why setting up a scanning search if you don't want any search to
happen?! I'd rather use a NoopSearch that never finds anything.

Why not. In this case we can add a configuration option for theDistributed James Server:


- ES search

- scanning search (expect low search performance but in some particularcases, so avoid search)


- Noop search (expect no search result, or errors when searching)

## Alternative

The alternative would be to build a different
product/flavor/server
than
the distributed one, where the only difference with the
distributed
one
is that indexing will rely on scanning instead of ElasticSearch.

The maintenance cost of such a product/flavor/server is higher
than
of a
configuration option (Docker images to release, time and energy
to
run
integration tests on it).

Such a product/flavor is hard to brand because even if it answers
a
need, it is not so far of the distributed one, and does not
answer
needs
that are very far from it neither.

The advantage is that is would allow to more fine tune this
solution
to
answer to the exact needs.

Another alternative would be:

* implement a SMTP+POP3 product where we can progressively remove
the
unneeded parts as we did for the SMTP-only product

* throttle the indexing to limit the impact of this process when
receiving a lot of mails (at the cost of having a search index not
so
up-to-date)

* be able to configure what is indexed (if we drop attachment
indexing
and full-text indexing we'll probably be way faster)

It's just an example of what we could do and there are a lot of
other
solutions.

I'm convinced both use cases are really differents and you put them
together because the solution to one problem happens to somehow
solve
another issue at once.

I propose to focus on the use case that is the most important right
now
and to search for solutions regardless of other issues we may have.

Or at least discuss these issues in two threads.

What do you think?

We have a simple solution allowing to configure a specific use case
in a
supported and wide used product. Is it better than a specific
solution
in a product which is harder to define for users and is probably not
enough of interest to be really well maintained?

I would say it's a matter of opinion: what you find simple, I find it
confusing, etc.

You also argued it's a common usage pattern to have James in this kind
of configuration so I guess people would be eager to maintain it?

Or are we proposing to maintain a feature that we don't expect to use
by ourselves and think is not so relevant? I would then propose to not
support that feature at all in this case.

To conclude with a proposition, I would say we should focus on the use
cases and then try to figure out what are the consequences.

In this case I would describe the use case as "As an enterprise IT
architect, I want to deploy James to handle mail interaction with
people from the outside in an internal domain application".

Not sure it covers the use case you have in mind so please comment with
your ideas.

For my part I have only one use case in mind, it's the one of mycustomer. A configuration option would solve it. A product would also(that's why I put it in "Alternative"). The customer prefer theconfiguration option, I also do for the reason I exposed.

The other arguments from other use cases are slightly related, and youare right, as I don't know exactly them, I cannot tell if my solutionwould fit their need too.

Finally the configuration option is already the object of a pullrequest, and it seems to be really simpler than to have a new product(in term of quantity of code and impact of the deployment -- of coursesimplicity is very subjective: for example Guice is simple for you, itcan be different for other people). If in the future a new product makesmore sense, reverting this PR and building a product around this wouldnot be too much of a burden. This other way would be: coming from 2(potentially incompatible) products to only one would be way harder interm of data migration we would have to implement.

That's also why I think the good choice for now if to add aconfiguration option.


Cheers,

Raphaël.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed James: make ElasticSearch indexing optional?

Reply via email to