Re: Filtering results by minimum relevancy score
gt;>sharded-scenario ? >>>So if you do limit=10, offset=10, each shard will return 20 docs ? >>>While if you do limit=10, _score<=last_page.min_score, then each shard >>> will >>>return 10 docs ? (they will still score all docs, but merging will be >>>faster) >>> >>>Makes sense ? >>> >>>On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti >>> <a.benede...@sease.io >>>> wrote: >>> >>>> Can i ask what is the final requirement here ? >>>> What are you trying to do ? >>>> - just display less results ? >>>> you can easily do at search client time, cutting after a certain amount >>>> - make search faster returning less results ? >>>> This is not going to work, as you need to score all of them as Erick >>>> explained. >>>> >>>> Function query ( as Mikhail specified) will run on a per document basis ( >>>> if >>>> I am correct), so if your idea was to speed up the things, this is not >>>> going >>>> to work. >>>> >>>> It makes much more sense to refine your system to improve relevancy if your >>>> concern is to have more relevant docs. >>>> If your concern is just to not show that many pages, you can limit that >>>> client side. >>>> >>>> >>>> >>>> >>>> >>>> >>>> - >>>> --- >>>> Alessandro Benedetti >>>> Search Consultant, R Software Engineer, Director >>>> Sease Ltd. - www.sease.io >>>> -- >>>> View this message in context: http://lucene.472066.n3. >>>> nabble.com/Filtering-results-by-minimum-relevancy-score- >>>> tp4329180p4329295.html >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>> >>>
Re: Filtering results by minimum relevancy score
Hi Koji, strictly talking about TF-IDF ( and BM25 which is an evolution of that approach) I would say it is a weighting function/numerical statistic that can be used for ranking functions and is based on probabilistic concepts ( such as IDF) but it is not a probabilistic function[1]. Indeed a BM25 score for a term is not assured to be 0<x<1 Furthermore Lucene and Solr adds a lot on top of the BM25 similarity ( including different kind of boost( document, field and query time boost, norms, coord ) so they use probabilistic concepts but they are not probabilistic search engine. [1] http://math.stackexchange.com/questions/610165/prove-that-the-bm25-scoring-function-is-probabilistic - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329715.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results by minimum relevancy score
Hi Walter, May I ask a tangential question? I'm curious the following line you wrote: > Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results is just not as good as vector-space engines. So, probabilistic engines are mostly dead. Can you elaborate this? I thought Okapi BM25, which is the default Similarity on Solr, is based on the probabilistic model. Did you mean that Lucene/Solr is still based on vector space model but they built BM25Similarity on top of it and therefore, BM25Similarity is not pure probabilistic scoring system or Okapi BM25 is not originally probabilistic? As for me, I prefer the idea of vector space than probabilistic for the information retrieval, and I stick with ClassicSimilarity for my projects. Thanks, Koji On 2017/04/13 4:08, Walter Underwood wrote: Fine. It can’t be done. If it was easy, Solr/Lucene would already have the feature, right? Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results is just not as good as vector-space engines. So, probabilistic engines are mostly dead. But, “you don’t want to do it” is very good advice. Instead of trying to reduce bad hits, work on increasing good hits. It is really hard, sometimes not possible, to optimize both. Increasing the good hits makes your customers happy. Reducing the bad hits makes your UX team happy. Here is a process. Start collecting the clicks on the search results page (SRP) with each query. Look at queries that have below average clickthrough. See if those can be combined into categories, then address each category. Some categories that I have used: * One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all valid. Use synonyms or shingles (and maybe the word delimiter filter) to match these. * Misspellings. These should be about 10% of queries. Use fuzzy matching. I recommend the patch in SOLR-629. * Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. People search for “kids movies”, but your movie genre is “Children and Family”. Use synonyms. * Missing content. People can’t find anything about beach parking because there isn’t a page about that. Instead, there are scraps of info about beach parking in multiple other pages. Fix the content. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote: The idea is to not return poorly matching results, not to limit the number of results returned. One query may have hundreds of excellent matches and another query may have 7. So cutting off by the number of results is trivial but not useful. Again, we are not doing this for performance reasons. We’re doing this because we don’t want to show products that are not very relevant to the search terms specified by the user for UX reasons. I had hoped that the responses would have been more focused on “it’ can’t be done” or “here’s how to do it” than “you don’t want to do it”. I’m still left not knowing if it’s even possible. The one concrete answer of using frange doesn’t help as referencing score in either the q or the fq produces an “undefined field” error. Thanks. On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io wrote: Can i ask what is the final requirement here ? What are you trying to do ? - just display less results ? you can easily do at search client time, cutting after a certain amount - make search faster returning less results ? This is not going to work, as you need to score all of them as Erick explained. Function query ( as Mikhail specified) will run on a per document basis ( if I am correct), so if your idea was to speed up the things, this is not going to work. It makes much more sense to refine your system to improve relevancy if your concern is to have more relevant docs. If your concern is just to not show that many pages, you can limit that client side. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3. nabble.com/Filtering-results-by-minimum-relevancy-score- tp4329180p4329295.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results by minimum relevancy score
Thank you! That worked. From: Ahmet Arslan <iori...@yahoo.com> Date: Wednesday, April 12, 2017 at 3:15 PM To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>, David Kramer <david.kra...@shoebuy.com> Subject: Re: Filtering results by minimum relevancy score Hi, I cannot find it. However it should be something like q=hello={!frange l=0.5}query($q) Ahmet On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan <iori...@yahoo.com.INVALID> wrote: Hi David, A function query named "query" returns the score for the given subquery. Combined with frange query parser this is possible. I tried it in the past.I am searching the original post. I think it was Yonik's post. https://cwiki.apache.org/confluence/display/solr/Function+Queries Ahmet On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer <david.kra...@shoebuy.com> wrote: The idea is to not return poorly matching results, not to limit the number of results returned. One query may have hundreds of excellent matches and another query may have 7. So cutting off by the number of results is trivial but not useful. Again, we are not doing this for performance reasons. We’re doing this because we don’t want to show products that are not very relevant to the search terms specified by the user for UX reasons. I had hoped that the responses would have been more focused on “it’ can’t be done” or “here’s how to do it” than “you don’t want to do it”. I’m still left not knowing if it’s even possible. The one concrete answer of using frange doesn’t help as referencing score in either the q or the fq produces an “undefined field” error. Thanks. On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io > wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
David I think it can be done, but a score has no real *meaning* to your domain other than the one you engineer into it. There's no 1-100 scale that guarantees at 100 that your users will love the results. Solr isn't really a turn key solution. It requires you to understand more deeply what relevance means in your domain and how to use the features of the engine to achieve the right use experience. What's a relevant result? What does Relevant mean for your users? What user experience are you creating? Is this a news search where you need to filter out old articles? Or ones that aren't trustworthy? Or articles where the body doesn't match enough user keywords? Or restaurants outside a certain radius as not usable? I've been in similar situation and usually getting rid of "low quality" results involves creative uses of filters to remove obvious low-value cases. You can create an fq for example that limits the results to only include articles where at least 2 keywords match the body field. Or express some minimum proximity, popularity, or recency requirement. I think you're going to meet frustration until you can pin down your users and/or your stakeholders on what they want. This is always the hard prob btw;) On Wed, Apr 12, 2017 at 11:45 AM David Kramer <david.kra...@shoebuy.com> wrote: > The idea is to not return poorly matching results, not to limit the number > of results returned. One query may have hundreds of excellent matches and > another query may have 7. So cutting off by the number of results is > trivial but not useful. > > Again, we are not doing this for performance reasons. We’re doing this > because we don’t want to show products that are not very relevant to the > search terms specified by the user for UX reasons. > > I had hoped that the responses would have been more focused on “it’ can’t > be done” or “here’s how to do it” than “you don’t want to do it”. I’m > still left not knowing if it’s even possible. The one concrete answer of > using frange doesn’t help as referencing score in either the q or the fq > produces an “undefined field” error. > > Thanks. > > On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: > > Can't the filter be used in cases when you're paginating in > sharded-scenario ? > So if you do limit=10, offset=10, each shard will return 20 docs ? > While if you do limit=10, _score<=last_page.min_score, then each shard > will > return 10 docs ? (they will still score all docs, but merging will be > faster) > > Makes sense ? > > On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti < > a.benede...@sease.io > > wrote: > > > Can i ask what is the final requirement here ? > > What are you trying to do ? > > - just display less results ? > > you can easily do at search client time, cutting after a certain > amount > > - make search faster returning less results ? > > This is not going to work, as you need to score all of them as Erick > > explained. > > > > Function query ( as Mikhail specified) will run on a per document > basis ( > > if > > I am correct), so if your idea was to speed up the things, this is > not > > going > > to work. > > > > It makes much more sense to refine your system to improve relevancy > if your > > concern is to have more relevant docs. > > If your concern is just to not show that many pages, you can limit > that > > client side. > > > > > > > > > > > > > > - > > --- > > Alessandro Benedetti > > Search Consultant, R Software Engineer, Director > > Sease Ltd. - www.sease.io > > -- > > View this message in context: http://lucene.472066.n3. > > nabble.com/Filtering-results-by-minimum-relevancy-score- > > tp4329180p4329295.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > >
Re: Filtering results by minimum relevancy score
Hi, I cannot find it. However it should be something like q=hello={!frange l=0.5}query($q) Ahmet On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan <iori...@yahoo.com.INVALID> wrote: Hi David, A function query named "query" returns the score for the given subquery. Combined with frange query parser this is possible. I tried it in the past.I am searching the original post. I think it was Yonik's post. https://cwiki.apache.org/confluence/display/solr/Function+Queries Ahmet On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer <david.kra...@shoebuy.com> wrote: The idea is to not return poorly matching results, not to limit the number of results returned. One query may have hundreds of excellent matches and another query may have 7. So cutting off by the number of results is trivial but not useful. Again, we are not doing this for performance reasons. We’re doing this because we don’t want to show products that are not very relevant to the search terms specified by the user for UX reasons. I had hoped that the responses would have been more focused on “it’ can’t be done” or “here’s how to do it” than “you don’t want to do it”. I’m still left not knowing if it’s even possible. The one concrete answer of using frange doesn’t help as referencing score in either the q or the fq produces an “undefined field” error. Thanks. On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io > wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
Fine. It can’t be done. If it was easy, Solr/Lucene would already have the feature, right? Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results is just not as good as vector-space engines. So, probabilistic engines are mostly dead. But, “you don’t want to do it” is very good advice. Instead of trying to reduce bad hits, work on increasing good hits. It is really hard, sometimes not possible, to optimize both. Increasing the good hits makes your customers happy. Reducing the bad hits makes your UX team happy. Here is a process. Start collecting the clicks on the search results page (SRP) with each query. Look at queries that have below average clickthrough. See if those can be combined into categories, then address each category. Some categories that I have used: * One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all valid. Use synonyms or shingles (and maybe the word delimiter filter) to match these. * Misspellings. These should be about 10% of queries. Use fuzzy matching. I recommend the patch in SOLR-629. * Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. People search for “kids movies”, but your movie genre is “Children and Family”. Use synonyms. * Missing content. People can’t find anything about beach parking because there isn’t a page about that. Instead, there are scraps of info about beach parking in multiple other pages. Fix the content. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote: > > The idea is to not return poorly matching results, not to limit the number of > results returned. One query may have hundreds of excellent matches and > another query may have 7. So cutting off by the number of results is trivial > but not useful. > > Again, we are not doing this for performance reasons. We’re doing this > because we don’t want to show products that are not very relevant to the > search terms specified by the user for UX reasons. > > I had hoped that the responses would have been more focused on “it’ can’t be > done” or “here’s how to do it” than “you don’t want to do it”. I’m still > left not knowing if it’s even possible. The one concrete answer of using > frange doesn’t help as referencing score in either the q or the fq produces > an “undefined field” error. > > Thanks. > > On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: > >Can't the filter be used in cases when you're paginating in >sharded-scenario ? >So if you do limit=10, offset=10, each shard will return 20 docs ? >While if you do limit=10, _score<=last_page.min_score, then each shard will >return 10 docs ? (they will still score all docs, but merging will be >faster) > >Makes sense ? > >On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti > <a.benede...@sease.io >> wrote: > >> Can i ask what is the final requirement here ? >> What are you trying to do ? >> - just display less results ? >> you can easily do at search client time, cutting after a certain amount >> - make search faster returning less results ? >> This is not going to work, as you need to score all of them as Erick >> explained. >> >> Function query ( as Mikhail specified) will run on a per document basis ( >> if >> I am correct), so if your idea was to speed up the things, this is not >> going >> to work. >> >> It makes much more sense to refine your system to improve relevancy if your >> concern is to have more relevant docs. >> If your concern is just to not show that many pages, you can limit that >> client side. >> >> >> >> >> >> >> - >> --- >> Alessandro Benedetti >> Search Consultant, R Software Engineer, Director >> Sease Ltd. - www.sease.io >> -- >> View this message in context: http://lucene.472066.n3. >> nabble.com/Filtering-results-by-minimum-relevancy-score- >> tp4329180p4329295.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >
Re: Filtering results by minimum relevancy score
Hi David, A function query named "query" returns the score for the given subquery. Combined with frange query parser this is possible. I tried it in the past.I am searching the original post. I think it was Yonik's post. https://cwiki.apache.org/confluence/display/solr/Function+Queries Ahmet On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer <david.kra...@shoebuy.com> wrote: The idea is to not return poorly matching results, not to limit the number of results returned. One query may have hundreds of excellent matches and another query may have 7. So cutting off by the number of results is trivial but not useful. Again, we are not doing this for performance reasons. We’re doing this because we don’t want to show products that are not very relevant to the search terms specified by the user for UX reasons. I had hoped that the responses would have been more focused on “it’ can’t be done” or “here’s how to do it” than “you don’t want to do it”. I’m still left not knowing if it’s even possible. The one concrete answer of using frange doesn’t help as referencing score in either the q or the fq produces an “undefined field” error. Thanks. On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io > wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
The idea is to not return poorly matching results, not to limit the number of results returned. One query may have hundreds of excellent matches and another query may have 7. So cutting off by the number of results is trivial but not useful. Again, we are not doing this for performance reasons. We’re doing this because we don’t want to show products that are not very relevant to the search terms specified by the user for UX reasons. I had hoped that the responses would have been more focused on “it’ can’t be done” or “here’s how to do it” than “you don’t want to do it”. I’m still left not knowing if it’s even possible. The one concrete answer of using frange doesn’t help as referencing score in either the q or the fq produces an “undefined field” error. Thanks. On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io > wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
Well, just because ES has it doesn't mean it's A Good Thing. IMO, it's just a "feel good" kind of thing for people who don't really understand scoring. >From that page: "Note, most times, this does not make much sense, but is provided for advanced use cases." I've written enough weasel-worded caveats to read the hidden message here (freely translated and purged of expletives): "OK, if you insist we'll provide this, and we'll make you feel good by saying it's for 'advanced use cases". We don't expect this to be useful at all, but it's easy to do and we'll waste more time arguing than just putting it in. P.S. don't call us when you find out this is useless". Best, Erick On Wed, Apr 12, 2017 at 7:37 AM, Shawn Heiseywrote: > On 4/10/2017 8:59 AM, David Kramer wrote: >> I’ve done quite a bit of searching on this. Pretty much every page I >> find says it’s a bad idea and won’t work well, but I’ve been asked to >> at least try it to reduce the number of completely unrelated results >> returned. We are not trying to normalize the number, or display it as >> a percentage, and I understand why those are not mathematically sound. >> We are relying on Solr for pagination, so we can’t just filter out low >> scores from the results. > > Here's my contribution. This boils down to nearly the same thing Erick > said, but stated in a very different way: The absolute score value has > zero meaning, for ANY purpose ... not just percentages or > normalization. If you try to use it, you're asking for disappointment. > > Scores only have meaning within a single query, and the only information > that's important is whether the score of one document is higher or lower > than the score of the rest of the documents in the same result. > Boosting lets you influence those relative scores, but the actual > numeric score of one document in a result doesn't reveal ANYTHING useful > about that document. > > I agree with Erick's general advice: Instead of trying to arbitrarily > decide which documents are scoring too low to be relevant, refine the > query so that irrelevant results are either completely excluded, or so > relevant documents will outscore irrelevant ones and the first few pages > will be good results. Users must be trained to expect irrelevant (and > slow) results if they paginate deeply. For performance reasons, you > should limit how many pages users can view on a result. > > Thanks, > Shawn >
Re: Filtering results by minimum relevancy score
On 4/10/2017 8:59 AM, David Kramer wrote: > I’ve done quite a bit of searching on this. Pretty much every page I > find says it’s a bad idea and won’t work well, but I’ve been asked to > at least try it to reduce the number of completely unrelated results > returned. We are not trying to normalize the number, or display it as > a percentage, and I understand why those are not mathematically sound. > We are relying on Solr for pagination, so we can’t just filter out low > scores from the results. Here's my contribution. This boils down to nearly the same thing Erick said, but stated in a very different way: The absolute score value has zero meaning, for ANY purpose ... not just percentages or normalization. If you try to use it, you're asking for disappointment. Scores only have meaning within a single query, and the only information that's important is whether the score of one document is higher or lower than the score of the rest of the documents in the same result. Boosting lets you influence those relative scores, but the actual numeric score of one document in a result doesn't reveal ANYTHING useful about that document. I agree with Erick's general advice: Instead of trying to arbitrarily decide which documents are scoring too low to be relevant, refine the query so that irrelevant results are either completely excluded, or so relevant documents will outscore irrelevant ones and the first few pages will be good results. Users must be trained to expect irrelevant (and slow) results if they paginate deeply. For performance reasons, you should limit how many pages users can view on a result. Thanks, Shawn
Re: Filtering results by minimum relevancy score
@alessandro Elastic-search has it: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html On Wed, Apr 12, 2017 at 1:49 PM, alessandro.benedetti <a.benede...@sease.io> wrote: > I am not completely sure that the potential benefit of merging less docs in > sharded pagination overcomes the additional time needed to apply the > filtering function query. > I would need to investigate more in details the frange internals. > > Cheers > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329489.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
I am not completely sure that the potential benefit of merging less docs in sharded pagination overcomes the additional time needed to apply the filtering function query. I would need to investigate more in details the frange internals. Cheers - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329489.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results by minimum relevancy score
Can't the filter be used in cases when you're paginating in sharded-scenario ? So if you do limit=10, offset=10, each shard will return 20 docs ? While if you do limit=10, _score<=last_page.min_score, then each shard will return 10 docs ? (they will still score all docs, but merging will be faster) Makes sense ? On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io > wrote: > Can i ask what is the final requirement here ? > What are you trying to do ? > - just display less results ? > you can easily do at search client time, cutting after a certain amount > - make search faster returning less results ? > This is not going to work, as you need to score all of them as Erick > explained. > > Function query ( as Mikhail specified) will run on a per document basis ( > if > I am correct), so if your idea was to speed up the things, this is not > going > to work. > > It makes much more sense to refine your system to improve relevancy if your > concern is to have more relevant docs. > If your concern is just to not show that many pages, you can limit that > client side. > > > > > > > - > --- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Filtering-results-by-minimum-relevancy-score- > tp4329180p4329295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Filtering results by minimum relevancy score
Can i ask what is the final requirement here ? What are you trying to do ? - just display less results ? you can easily do at search client time, cutting after a certain amount - make search faster returning less results ? This is not going to work, as you need to score all of them as Erick explained. Function query ( as Mikhail specified) will run on a per document basis ( if I am correct), so if your idea was to speed up the things, this is not going to work. It makes much more sense to refine your system to improve relevancy if your concern is to have more relevant docs. If your concern is just to not show that many pages, you can limit that client side. - --- Alessandro Benedetti Search Consultant, R Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329295.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering results by minimum relevancy score
Here we go https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser . But it's %100 YAGNI. You'd better tweak search to be more precise. On Mon, Apr 10, 2017 at 7:12 PM, Ahmet Arslanwrote: > Hi, > I remember that this is possible via frange query parser.But I don't have > the query string at hand. > Ahmet > On Monday, April 10, 2017, 9:00:09 PM GMT+3, David Kramer < > david.kra...@shoebuy.com> wrote: > I’ve done quite a bit of searching on this. Pretty much every page I find > says it’s a bad idea and won’t work well, but I’ve been asked to at least > try it to reduce the number of completely unrelated results returned. We > are not trying to normalize the number, or display it as a percentage, and > I understand why those are not mathematically sound. We are relying on > Solr for pagination, so we can’t just filter out low scores from the > results. > > I had assumed that you could use score in the filter query, but that > doesn’t appear to be the case. Is there a special way to reference it, or > is there another way to attack the problem? It seems like something that > should be allowed and possible. > > Thanks. > -- Sincerely yours Mikhail Khludnev
Re: Filtering results by minimum relevancy score
Hi, I remember that this is possible via frange query parser.But I don't have the query string at hand. Ahmet On Monday, April 10, 2017, 9:00:09 PM GMT+3, David Kramerwrote: I’ve done quite a bit of searching on this. Pretty much every page I find says it’s a bad idea and won’t work well, but I’ve been asked to at least try it to reduce the number of completely unrelated results returned. We are not trying to normalize the number, or display it as a percentage, and I understand why those are not mathematically sound. We are relying on Solr for pagination, so we can’t just filter out low scores from the results. I had assumed that you could use score in the filter query, but that doesn’t appear to be the case. Is there a special way to reference it, or is there another way to attack the problem? It seems like something that should be allowed and possible. Thanks.
Re: Filtering results by minimum relevancy score
Well, that's rather the point, the low-scoring docs aren't unrelated, someone just thinks they are. Flippancy aside, the score is, as you've researched, a bad gauge. Since Lucene has to compute the score of a doc before it knows the score, at any point in the collection process you may get a doc that's 10x the previous top score. Or 1/10x the previous low score. Point being that until the complete list is assembled, you really can't say much about any particular document. I think it's just a bad idea to try to use _score_ for this. Rather, refine how you query to reduce the numbers of unrelated documents. Of course then someone will complain that "there are docs I know that should be returned that aren't.". You mentioned trying to use the score in a filter query. How would that work? You don't know on the way in whether the top scoring doc will be 100 or 1. Even a normalized score can't be computed until you know the min/max, which you don't know until the last doc is scored. This is the inescapable tension between precision and recall. In essence, you're being asked to increase precision at the expense of recall (i.e. return fewer documents that are "more relevant"). The best way to do that is refine the query. Of course one option is to just count on people getting tired of paging. Best, er...@notverymuchhelp.com On Mon, Apr 10, 2017 at 7:59 AM, David Kramerwrote: > I’ve done quite a bit of searching on this. Pretty much every page I find > says it’s a bad idea and won’t work well, but I’ve been asked to at least try > it to reduce the number of completely unrelated results returned. We are not > trying to normalize the number, or display it as a percentage, and I > understand why those are not mathematically sound. We are relying on Solr > for pagination, so we can’t just filter out low scores from the results. > > I had assumed that you could use score in the filter query, but that doesn’t > appear to be the case. Is there a special way to reference it, or is there > another way to attack the problem? It seems like something that should be > allowed and possible. > > Thanks.
Filtering results by minimum relevancy score
I’ve done quite a bit of searching on this. Pretty much every page I find says it’s a bad idea and won’t work well, but I’ve been asked to at least try it to reduce the number of completely unrelated results returned. We are not trying to normalize the number, or display it as a percentage, and I understand why those are not mathematically sound. We are relying on Solr for pagination, so we can’t just filter out low scores from the results. I had assumed that you could use score in the filter query, but that doesn’t appear to be the case. Is there a special way to reference it, or is there another way to attack the problem? It seems like something that should be allowed and possible. Thanks.