Re: Email alerts with streaming expressions

Charlie Hull Tue, 07 Sep 2021 03:32:23 -0700

Hi Dan,

Yuval and my suggestions both rely on the same underlying code (Luwak,now called Lucene Monitor). This lets you store a set of Lucene queriesand run them against every new document.

The Lucene Monitor allows for very high-performance matching (I know ofsituations with around 1m stored queries, monitoring 1m new documents aday running on a few tens of nodes) and it does this with some cleveroptimisations: effectively it builds an index of your stored queries,and turns each new document into a query across this index (I know itsounds confusing!). It's a 'reverse search'. Check out the originalLuwak project as it's got links to several presentations and blogsshowing how others have implemented these systems.

The bit you'll have to build is the Solr layer and then the code thatuses this to generate alerts - and Solcolator andhttps://github.com/o19s/solr-monitor are two examples of how to do thefirst part, which you can build on. The facility to do a reverse searchis not built into Solr - yet, unlike Elasticsearch's Percolator.


Best

Charlie

On 07/09/2021 10:24, Dan Rosher wrote:

Thanks Eric, Charlie and Yuval for all the feedback and suggestions.

Eric: Yes I thought the monitoring might be a it of a pain, esp with
millions of them, I'll have to check out the topic code, but I wondered if
I can look @ the checkpoint collections for uniqueIds that haven't been
updated for a 'while' which might suggest the demon had stopped/died,
rather than checking each daemon individually?

I was also wondering whether it's possible, or a useful enhancement to look
at the replica index version (as opposed to _vesion_ ) for the topic
streaming expression to skip queries where the replica index is the same as
what we might store in the checkpoint collection ? For collections that
update infrequently I think this might be useful.

Charlie: It was for email alerts, so a user stores a query for collection
docs to match against, and then the system emails matches to the user. Do
you think solr-monitor can be used for this purpose?

Yuval: I like the idea of using the UpdateProcessor, at least there's no
need for deamons or monitoring of them, but would this scale for millions
of email queries though?

Many thanks again to all.

Kind regards,
Dan




On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yuval.p...@mail.huji.ac.il> wrote:

Me and my team are building upon this solcolator:
https://github.com/SOLR4189/solcolator

Currently the processor is build for Solr 6.5.1, we are working on updating
our Solr and I hope to release a complete version of our Solcolator  as
open source then (it will be for version 8.6.x).

Making it an update processor (either make it the last element and replace
the usual processor that index the document, or by using it as the one from
last processor in the collection, and so allow monitoring also atomic
updates [which is relatively costly]).

By making it an update processor we don't rely on the streaming deamon,
which we found unsatisfying as we wish to allow users to define their own
monitors over the index.

On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <ch...@opensourceconnections.com
wrote:

Are you trying to monitor a stream of emails for certain patterns? In
which case you might look at the Lucene Monitor

https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html

https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
Luwak - at my previous company Flax we helped build several large-scale
monitoring systems with this https://github.com/flaxsearch/luwak . It's
not officially surfaced in Solr yet although my colleague Scott Stults
has been working on some ideas: https://github.com/o19s/solr-monitor

best
Charlie

On 06/09/2021 14:32, Dan Rosher wrote:

Hi,

I was wondering if anyone had tried email alerts with streaming
expressions, and what their experience was if attempting this with say

million emails / day? Traditionally this might have been done with a
database cursor iterator daily.

I was thinking if something like the following pseudocode expression

with

'kafka' as a custom push expression:

daemon(id="alertId",
         runInterval="1000",
         kafka(
          kafka_topic,
          alertId,
          topic(email_alerts,
            doc_collection,
            q="email query",
            fl="id, title, abstract",
            id="alertId",
            initialCheckpoint=0)
          )

If you have done something like this 'where' would you typically run

the

daemon, on replicas away from replicas running web queries?

Many thanks in advance for any advice / suggestions,

Dan

--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
<www.o19s.com>
Founding member of The Search Network <https://thesearchnetwork.com/>
and co-author of Searching the Enterprise
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

--

Charlie Hull - Managing Consultant at OpenSource Connections Limited<www.o19s.com>Founding member of The Search Network <https://thesearchnetwork.com/>and co-author of Searching the Enterprise<https://opensourceconnections.com/about-us/books-resources/>

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Re: Email alerts with streaming expressions

Reply via email to