Hi Dan,
Yuval and my suggestions both rely on the same underlying code (Luwak,
now called Lucene Monitor). This lets you store a set of Lucene queries
and run them against every new document.
The Lucene Monitor allows for very high-performance matching (I know of
situations with around 1m stored queries, monitoring 1m new documents a
day running on a few tens of nodes) and it does this with some clever
optimisations: effectively it builds an index of your stored queries,
and turns each new document into a query across this index (I know it
sounds confusing!). It's a 'reverse search'. Check out the original
Luwak project as it's got links to several presentations and blogs
showing how others have implemented these systems.
The bit you'll have to build is the Solr layer and then the code that
uses this to generate alerts - and Solcolator and
https://github.com/o19s/solr-monitor are two examples of how to do the
first part, which you can build on. The facility to do a reverse search
is not built into Solr - yet, unlike Elasticsearch's Percolator.
Best
Charlie
On 07/09/2021 10:24, Dan Rosher wrote:
Thanks Eric, Charlie and Yuval for all the feedback and suggestions.
Eric: Yes I thought the monitoring might be a it of a pain, esp with
millions of them, I'll have to check out the topic code, but I wondered if
I can look @ the checkpoint collections for uniqueIds that haven't been
updated for a 'while' which might suggest the demon had stopped/died,
rather than checking each daemon individually?
I was also wondering whether it's possible, or a useful enhancement to look
at the replica index version (as opposed to _vesion_ ) for the topic
streaming expression to skip queries where the replica index is the same as
what we might store in the checkpoint collection ? For collections that
update infrequently I think this might be useful.
Charlie: It was for email alerts, so a user stores a query for collection
docs to match against, and then the system emails matches to the user. Do
you think solr-monitor can be used for this purpose?
Yuval: I like the idea of using the UpdateProcessor, at least there's no
need for deamons or monitoring of them, but would this scale for millions
of email queries though?
Many thanks again to all.
Kind regards,
Dan
On Mon, 6 Sept 2021 at 18:47, Yuval Paz <yuval.p...@mail.huji.ac.il> wrote:
Me and my team are building upon this solcolator:
https://github.com/SOLR4189/solcolator
Currently the processor is build for Solr 6.5.1, we are working on updating
our Solr and I hope to release a complete version of our Solcolator as
open source then (it will be for version 8.6.x).
Making it an update processor (either make it the last element and replace
the usual processor that index the document, or by using it as the one from
last processor in the collection, and so allow monitoring also atomic
updates [which is relatively costly]).
By making it an update processor we don't rely on the streaming deamon,
which we found unsatisfying as we wish to allow users to define their own
monitors over the index.
On Mon, Sep 6, 2021, 8:25 PM Charlie Hull <ch...@opensourceconnections.com
wrote:
Are you trying to monitor a stream of emails for certain patterns? In
which case you might look at the Lucene Monitor
https://lucene.apache.org/core/8_2_0/monitor/index.html?overview-summary.html
https://issues.apache.org/jira/browse/LUCENE-8766, which was originally
Luwak - at my previous company Flax we helped build several large-scale
monitoring systems with this https://github.com/flaxsearch/luwak . It's
not officially surfaced in Solr yet although my colleague Scott Stults
has been working on some ideas: https://github.com/o19s/solr-monitor
best
Charlie
On 06/09/2021 14:32, Dan Rosher wrote:
Hi,
I was wondering if anyone had tried email alerts with streaming
expressions, and what their experience was if attempting this with say
12
million emails / day? Traditionally this might have been done with a
database cursor iterator daily.
I was thinking if something like the following pseudocode expression
with
'kafka' as a custom push expression:
daemon(id="alertId",
runInterval="1000",
kafka(
kafka_topic,
alertId,
topic(email_alerts,
doc_collection,
q="email query",
fl="id, title, abstract",
id="alertId",
initialCheckpoint=0)
)
If you have done something like this 'where' would you typically run
the
daemon, on replicas away from replicas running web queries?
Many thanks in advance for any advice / suggestions,
Dan
--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
<www.o19s.com>
Founding member of The Search Network <https://thesearchnetwork.com/>
and co-author of Searching the Enterprise
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II
--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
<www.o19s.com>
Founding member of The Search Network <https://thesearchnetwork.com/>
and co-author of Searching the Enterprise
<https://opensourceconnections.com/about-us/books-resources/>
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II