Hi Raphael, My answers below
On Thu, 2020-06-11 at 18:01 +0200, Raphaël Ouazana-Sustowski wrote: > Hi, > > Here is a proposal to make ElasticSearch optional in our distributed > product/flavor/server. > > Comments are welcome. > > > ## Why? > > Some people have expressed the need of using a distributed James > without > ElasticSearch: > - in some comment here: > https://issues.apache.org/jira/browse/JAMES-3086 I read that people asking they are "not using search". I'm very curious about that: what does it mean to have a mail server with either IMAP and/or JMAP without using any search ? As far as I know, the main IMAP RFC requires some search support. Are they removing the `SearchProcessor` from the IMAP server and return errors to their clients? Do they expect that no user will ever hit the search button of their MUA? They complain about ElasticSearch indexing being slow (or one could also say expensive): wait until they do a full-scan search of users inbox (: I'm ok to have solutions for a different "upfront indexing cost"/"search performance" ratio but not to propose a distributed server relying on doing a full-scan of cassandra for every incoming search. We have to be open to custom usages of James and make it possible for developers to remove some features they don't need. But I'm not convinced a user should be able to do that with an configuration option. > - one of our customers plan to deploy a distributed James server for > serving POP3 encrypted emails. This deployment does not rely on > searching features. However as part of current Distributed James > server > he is forced to rely on ElasticSearch email indexing. > > This results in wasted resources as maintaining an ElasticSearch > cluster > to keep up with the volume is expensive. > Maintaining an ElasticSearch cluster when not needed is costly at > several levels: > - cost of infrastructure to deploy it > - cost of people having to maintain it > - performance cost on James to unnecessarily index data Meanwhile they also pay the price for IMAP and JMAP data indexing in mailbox code (generating ids that are never consumed but using cassandra LWT, same for modseq, projections in various tables that are never read, etc). And while they can easily disable such protocols they will still have a IMAP/JMAP server (with associated cost) serving only POP3. Does disabling only ES in that context makes sense at all for the Distributed James *product*? Shouldn't we craft a specific Distributed SMTP+POP product instead that would remove all wastes? > ## How ? > > Scanning search is a search implementation that is running on top of > any > mailbox implementation, even distributed ones and does not require > to > index data. > > Scanning Search is tested both at the component level (unit test) With 38 disabled testcases > but > also passes IMAP (MPT) tests on top of Cassandra implementation, > as well > as JMAP memory tests, thus delivers correct results. Of course it > does > not support full text search. > > We should allow Distributed James to optionally rely on scanning > search > instead of ElasticSearch. > > - Scanning search should be advised for deployments rarely > searching data > - ElasticSearch should be advised when search is frequent or > requires > high performance > > We could use module choosing [1] to choose between scanning search > and > ElasticSearch. > > To be noted that scanning search introduces no other dependencies as > it > is part of mailbox-store thus causes no risk of library clashes. > > To be noted also that metric collection and log collection using > ElasticSearch is unaffected. > You don't mentionned what will happen in the case of a search: we are probably going to read full mails for the searched mailbox or even for a given user in case of a multi-mailboxes search to find relevant emails. For a user with 10GiB of emails, it will for sure timeout and will probably bring the whole cluster on its knees. I don't find the scanning search relevant. > ## Alternative > > The alternative would be to build a different product/flavor/server > than > the distributed one, where the only difference with the distributed > one > is that indexing will rely on scanning instead of ElasticSearch. > > The maintenance cost of such a product/flavor/server is higher than > of a > configuration option (Docker images to release, time and energy to > run > integration tests on it). > > Such a product/flavor is hard to brand because even if it answers a > need, it is not so far of the distributed one, and does not answer > needs > that are very far from it neither. > > The advantage is that is would allow to more fine tune this solution > to > answer to the exact needs. > Another alternative would be: * implement a SMTP+POP3 product where we can progressively remove the unneeded parts as we did for the SMTP-only product * throttle the indexing to limit the impact of this process when receiving a lot of mails (at the cost of having a search index not so up-to-date) * be able to configure what is indexed (if we drop attachment indexing and full-text indexing we'll probably be way faster) It's just an example of what we could do and there are a lot of other solutions. I'm convinced both use cases are really differents and you put them together because the solution to one problem happens to somehow solve another issue at once. I propose to focus on the use case that is the most important right now and to search for solutions regardless of other issues we may have. Or at least discuss these issues in two threads. What do you think? -- Matthieu Baechler --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
