Re: Distributed James: make ElasticSearch indexing optional?

Matthieu Baechler Mon, 15 Jun 2020 00:53:03 -0700

Hi Raphael,

On Fri, 2020-06-12 at 18:29 +0200, Raphaël Ouazana-Sustowski wrote:
> Hello Matthieu,
> 
> Le 12/06/2020 à 10:05, Matthieu Baechler a écrit :
> > Hi Raphael,
> > 
> > My answers below
> > 
> > On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote:
> > > Hi,
> > > 
> > > Here is a proposal to make ElasticSearch optional in our
> > > distributed
> > > product/flavor/server.
> > > 
> > > Comments are welcome.
> > > 
> > > 
> > > ## Why?
> > > 
> > > Some people have expressed the need of using a distributed James
> > > without
> > > ElasticSearch:
> > > - in some comment here:
> > > https://issues.apache.org/jira/browse/JAMES-3086
> > I read that people asking they are "not using search". I'm very
> > curious
> > about that: what does it mean to have a mail server with either
> > IMAP
> > and/or JMAP without using any search ?
> > 
> > As far as I know, the main IMAP RFC requires some search support.
> > 
> > Are they removing the `SearchProcessor` from the IMAP server and
> > return
> > errors to their clients? Do they expect that no user will ever hit
> > the
> > search button of their MUA?
> 
> I see many use cases where you would not need search, essentially
> based 
> on automatic mail processing, which is a common James workflow.


Does it still make sense to support IMAP at this point? I'm almost sure
people would expect REST and/or MQ in this case, don't you think?

> 
> > They complain about ElasticSearch indexing being slow (or one could
> > also say expensive): wait until they do a full-scan search of users
> > inbox (:
> > 
> > I'm ok to have solutions for a different "upfront indexing
> > cost"/"search performance" ratio but not to propose a distributed
> > server relying on doing a full-scan of cassandra for every incoming
> > search.
> > 
> > We have to be open to custom usages of James and make it possible
> > for
> > developers to remove some features they don't need. But I'm not
> > convinced a user should be able to do that with an configuration
> > option.
> 
> That's not only because of indexing being slow, it's also to get rid
> of 
> the whole ElasticSearch cluster.

The link you provided, to which I refer when talking about indexing
being slow, is about indexing being slow as far as I understand. Let's
deal with these two cases in different threads to avoid confusion.

> 
> > > - one of our customers plan to deploy a distributed James server
> > > for
> > > serving POP3 encrypted emails. This deployment does not rely on
> > > searching features. However as part of current Distributed James
> > > server
> > > he is forced to rely on ElasticSearch email indexing.
> > > 
> > > This results in wasted resources as maintaining an ElasticSearch
> > > cluster
> > > to keep up with the volume is expensive.
> > > Maintaining an ElasticSearch cluster when not needed is costly at
> > > several levels:
> > > - cost of infrastructure to deploy it
> > > - cost of people having to maintain it
> > > - performance cost on James to unnecessarily index data
> > Meanwhile they also pay the price for IMAP and JMAP data indexing
> > in
> > mailbox code (generating ids that are never consumed but using
> > cassandra LWT, same for modseq, projections in various tables that
> > are
> > never read, etc).
> > 
> > And while they can easily disable such protocols they will still
> > have a
> > IMAP/JMAP server (with associated cost) serving only POP3.
> > 
> > Does disabling only ES in that context makes sense at all for the
> > Distributed James *product*?
> > 
> > Shouldn't we craft a specific Distributed SMTP+POP product instead
> > that
> > would remove all wastes?
> 
> It makes sense because it allows to easily go back from one 
> configuration to the other. Going back and forth between scanning 
> implementation and ES one is pretty easy.

As long as you don't have real users with mails. How long will a full-
reindex (that is supposed to be slow according to user complains) take
with some Terabytes of emails? Is it what you call "easy"? Because
having a Distributed Mail Server without a huge amount of data doesn't
make much sense.

So, let's be realistic: this switch, while possible with some
configuration would be quite hard to handle properly in real world (it
requires at least some ops and active monitoring).

> 
> Having a new (potentially optimized) product could be great in some 
> cases, but would totally go against this.

Can we have arguments? 

Bundling too many use cases in a single product is not very appealing
to me because I suspect it will become be too complex by doing too many
different things, confusing to user because we'll have to explain
carefully in which case a specific option make sense, hard to maintain,
because it's hard to make good choices when we can't figure out what
are our users, etc.

> 
> > > ## How ?
> > > 
> > > Scanning search is a search implementation that is running on top
> > > of
> > > any
> > > mailbox implementation, even distributed ones and does not
> > > require
> > > to
> > > index data.
> > > 
> > > Scanning Search is tested both at the component level (unit test)
> > With 38 disabled testcases
> > 
> > > but
> > > also passes IMAP (MPT) tests on top of Cassandra implementation,
> > >   as well
> > > as JMAP memory tests, thus delivers correct results. Of course it
> > > does
> > > not support full text search.
> > > 
> > > We should allow Distributed James to optionally rely on scanning
> > > search
> > > instead of ElasticSearch.
> > > 
> > >    - Scanning search should be advised for deployments rarely
> > > searching data
> > >    - ElasticSearch should be advised when search is frequent or
> > > requires
> > > high performance
> > > 
> > > We could use module choosing [1] to choose between scanning
> > > search
> > > and
> > > ElasticSearch.
> > > 
> > > To be noted that scanning search introduces no other dependencies
> > > as
> > > it
> > > is part of mailbox-store thus causes no risk of library clashes.
> > > 
> > > To be noted also that metric collection and log collection using
> > > ElasticSearch is unaffected.
> > > 
> > You don't mentionned what will happen in the case of a search: we
> > are
> > probably going to read full mails for the searched mailbox or even
> > for
> > a given user in case of a multi-mailboxes search to find relevant
> > emails.
> > 
> > For a user with 10GiB of emails, it will for sure timeout and will
> > probably bring the whole cluster on its knees.
> > 
> > I don't find the scanning search relevant.
> 
> There is no case of search.

I don't understand that sentence: what are you describing? I mean, one
can send a SEARCH command to IMAP and it would be served by the
scanning search, right?

>  If there is one, the only thing to do is to 
> add an ES cluster and change the configuration option.

Why setting up a scanning search if you don't want any search to
happen?! I'd rather use a NoopSearch that never finds anything.

> 
> > > ## Alternative
> > > 
> > > The alternative would be to build a different
> > > product/flavor/server
> > > than
> > > the distributed one, where the only difference with the
> > > distributed
> > > one
> > > is that indexing will rely on scanning instead of ElasticSearch.
> > > 
> > > The maintenance cost of such a product/flavor/server is higher
> > > than
> > > of a
> > > configuration option (Docker images to release, time and energy
> > > to
> > > run
> > > integration tests on it).
> > > 
> > > Such a product/flavor is hard to brand because even if it answers
> > > a
> > > need, it is not so far of the distributed one, and does not
> > > answer
> > > needs
> > > that are very far from it neither.
> > > 
> > > The advantage is that is would allow to more fine tune this
> > > solution
> > > to
> > > answer to the exact needs.
> > > 
> > 
> > Another alternative would be:
> > 
> > * implement a SMTP+POP3 product where we can progressively remove
> > the
> > unneeded parts as we did for the SMTP-only product
> > 
> > * throttle the indexing to limit the impact of this process when
> > receiving a lot of mails (at the cost of having a search index not
> > so
> > up-to-date)
> > 
> > * be able to configure what is indexed (if we drop attachment
> > indexing
> > and full-text indexing we'll probably be way faster)
> > 
> > It's just an example of what we could do and there are a lot of
> > other
> > solutions.
> > 
> > I'm convinced both use cases are really differents and you put them
> > together because the solution to one problem happens to somehow
> > solve
> > another issue at once.
> > 
> > I propose to focus on the use case that is the most important right
> > now
> > and to search for solutions regardless of other issues we may have.
> > 
> > Or at least discuss these issues in two threads.
> > 
> > What do you think?
> > 
> 
> We have a simple solution allowing to configure a specific use case
> in a 
> supported and wide used product. Is it better than a specific
> solution 
> in a product which is harder to define for users and is probably not 
> enough of interest to be really well maintained?
> 

I would say it's a matter of opinion: what you find simple, I find it
confusing, etc. 

You also argued it's a common usage pattern to have James in this kind
of configuration so I guess people would be eager to maintain it?

Or are we proposing to maintain a feature that we don't expect to use
by ourselves and think is not so relevant? I would then propose to not
support that feature at all in this case.

To conclude with a proposition, I would say we should focus on the use
cases and then try to figure out what are the consequences.

In this case I would describe the use case as "As an enterprise IT
architect, I want to deploy James to handle mail interaction with
people from the outside in an internal domain application".

Not sure it covers the use case you have in mind so please comment with
your ideas.

Cheers,

-- Matthieu Baechler




---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Re: Distributed James: make ElasticSearch indexing optional?

Reply via email to