Hi Chris, I think there is some confusion here. When people say things about relevance scores they talk about comparing them across queries. What you have is a different situation, or at least a situation that lends itself to working around this, at least partially.
You have N users. Each user enters N queries. You have incoming stream of documents that you wan to match against all users' saved queries. When a new document is matched you could: 1) send it to user right away 2) store it somewhere as a document that matched a query Q and send all matches to users periodically. If you go with 1) then either you send all matches to users, or you introduce the notion of the score thresholds. That's bad for the reason you already identified. If you go with 2) then you have the option of batching up matches for each saved query and alerting users only every N hours. Then, you could introduce logic that says: "If there are >N matches for query Q then remove all matches with score <S" "If there are >M matches for query Q, then remove all matches with score <R" "If there are <Z matches for query Q, then keep all matches" ... Maybe you can turn this into a feature in your product ;) Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Chris Harris <rygu...@gmail.com> >To: solr-user@lucene.apache.org >Sent: Wednesday, May 9, 2012 4:50 AM >Subject: Can one determine which results are "good enough" to alert users >about? > >I'm trying to think through a Solr-based email alerting engine that >would have the following properties: > >1. Users can enter queries they want to be alerted on, and the syntax >for alert queries should be the same syntax as my regular solr >(dismax) queries. > >1a. Corollary: Because of not just tf-idf but also dismax pf and qf >boosting, this implies that the set of documents that match a given >query will vary widely in quality; the first page of search results >will be quite good, but the last page won't be worth looking at. > >2. The email alerting engine shouldn't bother alerting people about >*all* new results for a given query; in particular it should avoid the >poor-quality tail of results and just alert on "the good stuff". > >Unfortunately, my current understanding of Solr/Lucene is that there's >not a good automatic way to partition the set of query results into >"good stuff" vs "not good stuff". The main option I know of is to >filter out documents below a certain score threshold, but if you >search the Lucene/Solr mailing lists, people will advise that this is >unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr >scores wasn't especially designed to mean anything as absolute >numbers, only when compared to other scores.) > >This makes me wonder if there's something wrong with my original >requirements, or whether people have thought of some other way to >approach this. > >Interestingly, Google appears to have solved this at least to some >degree with Google Alerts (http://www.google.com/alerts); there you >can choose to receive "Only the best results" rather than "All the >results". I'm not clear how they determine which results are "best", >but their UI certainly implies they've come up with some scheme for >it. > >Thanks, >Chris > > >