Re: Filtering results by minimum relevancy score

2017-04-13 Thread Walter Underwood
gt;>sharded-scenario ?
>>>So if you do limit=10, offset=10, each shard will return 20 docs ?
>>>While if you do limit=10, _score<=last_page.min_score, then each shard 
>>> will
>>>return 10 docs ? (they will still score all docs, but merging will be
>>>faster)
>>> 
>>>Makes sense ?
>>> 
>>>On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti 
>>> <a.benede...@sease.io
>>>> wrote:
>>> 
>>>> Can i ask what is the final requirement here ?
>>>> What are you trying to do ?
>>>> - just display less results ?
>>>> you can easily do at search client time, cutting after a certain amount
>>>> - make search faster returning less results ?
>>>> This is not going to work, as you need to score all of them as Erick
>>>> explained.
>>>> 
>>>> Function query ( as Mikhail specified) will run on a per document basis (
>>>> if
>>>> I am correct), so if your idea was to speed up the things, this is not
>>>> going
>>>> to work.
>>>> 
>>>> It makes much more sense to refine your system to improve relevancy if your
>>>> concern is to have more relevant docs.
>>>> If your concern is just to not show that many pages, you can limit that
>>>> client side.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -
>>>> ---
>>>> Alessandro Benedetti
>>>> Search Consultant, R Software Engineer, Director
>>>> Sease Ltd. - www.sease.io
>>>> --
>>>> View this message in context: http://lucene.472066.n3.
>>>> nabble.com/Filtering-results-by-minimum-relevancy-score-
>>>> tp4329180p4329295.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>> 
>>> 
>>> 



Re: Filtering results by minimum relevancy score

2017-04-13 Thread alessandro.benedetti
Hi Koji,
strictly talking about TF-IDF ( and BM25 which is an evolution of that
approach) I would say it is a weighting function/numerical statistic that
can be used for ranking functions and is based on probabilistic concepts (
such as IDF) but it is not a probabilistic function[1].
Indeed a BM25 score for a term is not assured to be 0<x<1

Furthermore Lucene and Solr adds a lot on top of the BM25 similarity (
including different kind of boost( document, field and query time boost,
norms, coord ) so they use probabilistic concepts but they are not
probabilistic search engine.

[1]
http://math.stackexchange.com/questions/610165/prove-that-the-bm25-scoring-function-is-probabilistic



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering results by minimum relevancy score

2017-04-12 Thread Koji Sekiguchi

Hi Walter,

May I ask a tangential question? I'm curious the following line you wrote:

> Solr is a vector-space engine. Some early engines (Verity VDK) were probabilistic engines. Those 
do give an absolute estimate of the relevance of each hit. Unfortunately, the relevance of results 
is just not as good as vector-space engines. So, probabilistic engines are mostly dead.


Can you elaborate this?

I thought Okapi BM25, which is the default Similarity on Solr, is based on the 
probabilistic
model. Did you mean that Lucene/Solr is still based on vector space model but 
they built
BM25Similarity on top of it and therefore, BM25Similarity is not pure 
probabilistic scoring
system or Okapi BM25 is not originally probabilistic?

As for me, I prefer the idea of vector space than probabilistic for the 
information retrieval,
and I stick with ClassicSimilarity for my projects.

Thanks,

Koji


On 2017/04/13 4:08, Walter Underwood wrote:

Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were 
probabilistic engines. Those do give an absolute estimate of the relevance of 
each hit. Unfortunately, the relevance of results is just not as good as 
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce 
bad hits, work on increasing good hits. It is really hard, sometimes not 
possible, to optimize both. Increasing the good hits makes your customers 
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP) 
with each query. Look at queries that have below average clickthrough. See if 
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match 
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. 
People search for “kids movies”, but your movie genre is “Children and Family”. 
Use synonyms.

* Missing content. People can’t find anything about beach parking because there 
isn’t a page about that. Instead, there are scraps of info about beach parking 
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote:

The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.   I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io

wrote:



Can i ask what is the final requirement here ?
What are you trying to do ?
- just display less results ?
you can easily do at search client time, cutting after a certain amount
- make search faster returning less results ?
This is not going to work, as you need to score all of them as Erick
explained.

Function query ( as Mikhail specified) will run on a per document basis (
if
I am correct), so if your idea was to speed up the things, this is not
going
to work.

It makes much more sense to refine your system to improve relevancy if your
concern is to have more relevant docs.
If your concern is just to not show that many pages, you can limit that
client side.






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.
nabble.com/Filtering-results-by-minimum-relevancy-score-
tp4329180p4329295.html
Sent from the Solr - User mailing list archive at Nabble.com.









Re: Filtering results by minimum relevancy score

2017-04-12 Thread David Kramer
Thank you!  That worked.


From: Ahmet Arslan <iori...@yahoo.com>
Date: Wednesday, April 12, 2017 at 3:15 PM
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>, David Kramer 
<david.kra...@shoebuy.com>
Subject: Re: Filtering results by minimum relevancy score

Hi,

I cannot find it. However it should be something like

q=hello={!frange l=0.5}query($q)

Ahmet

On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan 
<iori...@yahoo.com.INVALID> wrote:
Hi David,
A function query named "query" returns the score for the given subquery.
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries


Ahmet


On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
<david.kra...@shoebuy.com> wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
> wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if 
your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



Re: Filtering results by minimum relevancy score

2017-04-12 Thread Doug Turnbull
David I think it can be done, but a score has no real *meaning* to your
domain other than the one you engineer into it. There's no 1-100 scale that
guarantees at 100 that your users will love the results.

Solr isn't really a turn key solution. It requires you to understand more
deeply what relevance means in your domain and how to use the features of
the engine to achieve the right use experience.

What's a relevant result? What does Relevant mean for your users? What user
experience are you creating?

Is this a news search where you need to filter out old articles? Or ones
that aren't trustworthy? Or articles where the body doesn't match enough
user keywords? Or restaurants outside a certain radius as not usable?


I've been in similar situation and usually getting rid of "low quality"
results involves creative uses of filters to remove obvious low-value
cases. You can create an fq for example that limits the results to only
include articles where at least 2 keywords match the body field. Or express
some minimum proximity, popularity, or recency requirement.

I think you're going to meet frustration until you can pin down your users
and/or your stakeholders on what they want. This is always the hard prob
btw;)


On Wed, Apr 12, 2017 at 11:45 AM David Kramer <david.kra...@shoebuy.com>
wrote:

> The idea is to not return poorly matching results, not to limit the number
> of results returned.  One query may have hundreds of excellent matches and
> another query may have 7. So cutting off by the number of results is
> trivial but not useful.
>
> Again, we are not doing this for performance reasons. We’re doing this
> because we don’t want to show products that are not very relevant to the
> search terms specified by the user for UX reasons.
>
> I had hoped that the responses would have been more focused on “it’ can’t
> be done” or “here’s how to do it” than “you don’t want to do it”.   I’m
> still left not knowing if it’s even possible. The one concrete answer of
> using frange doesn’t help as referencing score in either the q or the fq
> produces an “undefined field” error.
>
> Thanks.
>
> On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:
>
> Can't the filter be used in cases when you're paginating in
> sharded-scenario ?
> So if you do limit=10, offset=10, each shard will return 20 docs ?
> While if you do limit=10, _score<=last_page.min_score, then each shard
> will
> return 10 docs ? (they will still score all docs, but merging will be
> faster)
>
> Makes sense ?
>
> On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <
> a.benede...@sease.io
> > wrote:
>
> > Can i ask what is the final requirement here ?
> > What are you trying to do ?
> >  - just display less results ?
> > you can easily do at search client time, cutting after a certain
> amount
> > - make search faster returning less results ?
> > This is not going to work, as you need to score all of them as Erick
> > explained.
> >
> > Function query ( as Mikhail specified) will run on a per document
> basis (
> > if
> > I am correct), so if your idea was to speed up the things, this is
> not
> > going
> > to work.
> >
> > It makes much more sense to refine your system to improve relevancy
> if your
> > concern is to have more relevant docs.
> > If your concern is just to not show that many pages, you can limit
> that
> > client side.
> >
> >
> >
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Filtering-results-by-minimum-relevancy-score-
> > tp4329180p4329295.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>


Re: Filtering results by minimum relevancy score

2017-04-12 Thread Ahmet Arslan
Hi,
I cannot find it. However it should be something like 
q=hello={!frange l=0.5}query($q)

Ahmet
On Wednesday, April 12, 2017, 10:07:54 PM GMT+3, Ahmet Arslan 
<iori...@yahoo.com.INVALID> wrote:
Hi David,
A function query named "query" returns the score for the given subquery. 
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries


Ahmet


On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
<david.kra...@shoebuy.com> wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

    Can't the filter be used in cases when you're paginating in
    sharded-scenario ?
    So if you do limit=10, offset=10, each shard will return 20 docs ?
    While if you do limit=10, _score<=last_page.min_score, then each shard will
    return 10 docs ? (they will still score all docs, but merging will be
    faster)
    
    Makes sense ?
    
    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
    > wrote:
    
    > Can i ask what is the final requirement here ?
    > What are you trying to do ?
    >  - just display less results ?
    > you can easily do at search client time, cutting after a certain amount
    > - make search faster returning less results ?
    > This is not going to work, as you need to score all of them as Erick
    > explained.
    >
    > Function query ( as Mikhail specified) will run on a per document basis (
    > if
    > I am correct), so if your idea was to speed up the things, this is not
    > going
    > to work.
    >
    > It makes much more sense to refine your system to improve relevancy if 
your
    > concern is to have more relevant docs.
    > If your concern is just to not show that many pages, you can limit that
    > client side.
    >
    >
    >
    >
    >
    >
    > -
    > ---
    > Alessandro Benedetti
    > Search Consultant, R Software Engineer, Director
    > Sease Ltd. - www.sease.io
    > --
    > View this message in context: http://lucene.472066.n3.
    > nabble.com/Filtering-results-by-minimum-relevancy-score-
    > tp4329180p4329295.html
    > Sent from the Solr - User mailing list archive at Nabble.com.
    >
    

Re: Filtering results by minimum relevancy score

2017-04-12 Thread Walter Underwood
Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
feature, right?

Solr is a vector-space engine. Some early engines (Verity VDK) were 
probabilistic engines. Those do give an absolute estimate of the relevance of 
each hit. Unfortunately, the relevance of results is just not as good as 
vector-space engines. So, probabilistic engines are mostly dead.

But, “you don’t want to do it” is very good advice. Instead of trying to reduce 
bad hits, work on increasing good hits. It is really hard, sometimes not 
possible, to optimize both. Increasing the good hits makes your customers 
happy. Reducing the bad hits makes your UX team happy.

Here is a process. Start collecting the clicks on the search results page (SRP) 
with each query. Look at queries that have below average clickthrough. See if 
those can be combined into categories, then address each category.

Some categories that I have used:

* One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
valid. Use synonyms or shingles (and maybe the word delimiter filter) to match 
these.

* Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
recommend the patch in SOLR-629.

* Alternate vocabulary. You sell a “laptop”, but people call it a “notebook”. 
People search for “kids movies”, but your movie genre is “Children and Family”. 
Use synonyms.

* Missing content. People can’t find anything about beach parking because there 
isn’t a page about that. Instead, there are scraps of info about beach parking 
in multiple other pages. Fix the content.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote:
> 
> The idea is to not return poorly matching results, not to limit the number of 
> results returned.  One query may have hundreds of excellent matches and 
> another query may have 7. So cutting off by the number of results is trivial 
> but not useful.
> 
> Again, we are not doing this for performance reasons. We’re doing this 
> because we don’t want to show products that are not very relevant to the 
> search terms specified by the user for UX reasons.
> 
> I had hoped that the responses would have been more focused on “it’ can’t be 
> done” or “here’s how to do it” than “you don’t want to do it”.   I’m still 
> left not knowing if it’s even possible. The one concrete answer of using 
> frange doesn’t help as referencing score in either the q or the fq produces 
> an “undefined field” error.
> 
> Thanks.
> 
> On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:
> 
>Can't the filter be used in cases when you're paginating in
>sharded-scenario ?
>So if you do limit=10, offset=10, each shard will return 20 docs ?
>While if you do limit=10, _score<=last_page.min_score, then each shard will
>return 10 docs ? (they will still score all docs, but merging will be
>faster)
> 
>Makes sense ?
> 
>On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti 
> <a.benede...@sease.io
>> wrote:
> 
>> Can i ask what is the final requirement here ?
>> What are you trying to do ?
>> - just display less results ?
>> you can easily do at search client time, cutting after a certain amount
>> - make search faster returning less results ?
>> This is not going to work, as you need to score all of them as Erick
>> explained.
>> 
>> Function query ( as Mikhail specified) will run on a per document basis (
>> if
>> I am correct), so if your idea was to speed up the things, this is not
>> going
>> to work.
>> 
>> It makes much more sense to refine your system to improve relevancy if your
>> concern is to have more relevant docs.
>> If your concern is just to not show that many pages, you can limit that
>> client side.
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/Filtering-results-by-minimum-relevancy-score-
>> tp4329180p4329295.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
> 
> 



Re: Filtering results by minimum relevancy score

2017-04-12 Thread Ahmet Arslan
Hi David,
A function query named "query" returns the score for the given subquery. 
Combined with frange query parser this is possible. I tried it in the past.I am 
searching the original post. I think it was Yonik's post.
https://cwiki.apache.org/confluence/display/solr/Function+Queries


Ahmet


On Wednesday, April 12, 2017, 9:45:17 PM GMT+3, David Kramer 
<david.kra...@shoebuy.com> wrote:
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.  I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

    Can't the filter be used in cases when you're paginating in
    sharded-scenario ?
    So if you do limit=10, offset=10, each shard will return 20 docs ?
    While if you do limit=10, _score<=last_page.min_score, then each shard will
    return 10 docs ? (they will still score all docs, but merging will be
    faster)
    
    Makes sense ?
    
    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
    > wrote:
    
    > Can i ask what is the final requirement here ?
    > What are you trying to do ?
    >  - just display less results ?
    > you can easily do at search client time, cutting after a certain amount
    > - make search faster returning less results ?
    > This is not going to work, as you need to score all of them as Erick
    > explained.
    >
    > Function query ( as Mikhail specified) will run on a per document basis (
    > if
    > I am correct), so if your idea was to speed up the things, this is not
    > going
    > to work.
    >
    > It makes much more sense to refine your system to improve relevancy if 
your
    > concern is to have more relevant docs.
    > If your concern is just to not show that many pages, you can limit that
    > client side.
    >
    >
    >
    >
    >
    >
    > -
    > ---
    > Alessandro Benedetti
    > Search Consultant, R Software Engineer, Director
    > Sease Ltd. - www.sease.io
    > --
    > View this message in context: http://lucene.472066.n3.
    > nabble.com/Filtering-results-by-minimum-relevancy-score-
    > tp4329180p4329295.html
    > Sent from the Solr - User mailing list archive at Nabble.com.
    >
    


Re: Filtering results by minimum relevancy score

2017-04-12 Thread David Kramer
The idea is to not return poorly matching results, not to limit the number of 
results returned.  One query may have hundreds of excellent matches and another 
query may have 7. So cutting off by the number of results is trivial but not 
useful.

Again, we are not doing this for performance reasons. We’re doing this because 
we don’t want to show products that are not very relevant to the search terms 
specified by the user for UX reasons.

I had hoped that the responses would have been more focused on “it’ can’t be 
done” or “here’s how to do it” than “you don’t want to do it”.   I’m still left 
not knowing if it’s even possible. The one concrete answer of using frange 
doesn’t help as referencing score in either the q or the fq produces an 
“undefined field” error.

Thanks.

On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote:

Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
> wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if 
your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
    > --
    > View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>




Re: Filtering results by minimum relevancy score

2017-04-12 Thread Erick Erickson
Well, just because ES has it doesn't mean it's A Good Thing. IMO, it's
just a "feel good" kind of thing for people who don't really
understand scoring.

>From that page: "Note, most times, this does not make much sense, but
is provided for advanced use cases."

I've written enough weasel-worded caveats to read the hidden message
here (freely translated and purged of expletives):

"OK, if you insist we'll provide this, and we'll make you feel good by
saying it's for 'advanced use cases". We don't expect this to be
useful at all, but it's easy to do and we'll waste more time arguing
than just putting it in. P.S. don't call us when you find out this is
useless".

Best,
Erick

On Wed, Apr 12, 2017 at 7:37 AM, Shawn Heisey  wrote:
> On 4/10/2017 8:59 AM, David Kramer wrote:
>> I’ve done quite a bit of searching on this. Pretty much every page I
>> find says it’s a bad idea and won’t work well, but I’ve been asked to
>> at least try it to reduce the number of completely unrelated results
>> returned. We are not trying to normalize the number, or display it as
>> a percentage, and I understand why those are not mathematically sound.
>> We are relying on Solr for pagination, so we can’t just filter out low
>> scores from the results.
>
> Here's my contribution.  This boils down to nearly the same thing Erick
> said, but stated in a very different way: The absolute score value has
> zero meaning, for ANY purpose ... not just percentages or
> normalization.  If you try to use it, you're asking for disappointment.
>
> Scores only have meaning within a single query, and the only information
> that's important is whether the score of one document is higher or lower
> than the score of the rest of the documents in the same result.
> Boosting lets you influence those relative scores, but the actual
> numeric score of one document in a result doesn't reveal ANYTHING useful
> about that document.
>
> I agree with Erick's general advice:  Instead of trying to arbitrarily
> decide which documents are scoring too low to be relevant, refine the
> query so that irrelevant results are either completely excluded, or so
> relevant documents will outscore irrelevant ones and the first few pages
> will be good results.  Users must be trained to expect irrelevant (and
> slow) results if they paginate deeply.  For performance reasons, you
> should limit how many pages users can view on a result.
>
> Thanks,
> Shawn
>


Re: Filtering results by minimum relevancy score

2017-04-12 Thread Shawn Heisey
On 4/10/2017 8:59 AM, David Kramer wrote:
> I’ve done quite a bit of searching on this. Pretty much every page I
> find says it’s a bad idea and won’t work well, but I’ve been asked to
> at least try it to reduce the number of completely unrelated results
> returned. We are not trying to normalize the number, or display it as
> a percentage, and I understand why those are not mathematically sound.
> We are relying on Solr for pagination, so we can’t just filter out low
> scores from the results. 

Here's my contribution.  This boils down to nearly the same thing Erick
said, but stated in a very different way: The absolute score value has
zero meaning, for ANY purpose ... not just percentages or
normalization.  If you try to use it, you're asking for disappointment.

Scores only have meaning within a single query, and the only information
that's important is whether the score of one document is higher or lower
than the score of the rest of the documents in the same result. 
Boosting lets you influence those relative scores, but the actual
numeric score of one document in a result doesn't reveal ANYTHING useful
about that document.

I agree with Erick's general advice:  Instead of trying to arbitrarily
decide which documents are scoring too low to be relevant, refine the
query so that irrelevant results are either completely excluded, or so
relevant documents will outscore irrelevant ones and the first few pages
will be good results.  Users must be trained to expect irrelevant (and
slow) results if they paginate deeply.  For performance reasons, you
should limit how many pages users can view on a result.

Thanks,
Shawn



Re: Filtering results by minimum relevancy score

2017-04-12 Thread Dorian Hoxha
@alessandro
Elastic-search has it:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html

On Wed, Apr 12, 2017 at 1:49 PM, alessandro.benedetti <a.benede...@sease.io>
wrote:

> I am not completely sure that the potential benefit of merging less docs in
> sharded pagination overcomes the additional time needed to apply the
> filtering function query.
> I would need to investigate more in details the frange internals.
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329489.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Filtering results by minimum relevancy score

2017-04-12 Thread alessandro.benedetti
I am not completely sure that the potential benefit of merging less docs in
sharded pagination overcomes the additional time needed to apply the
filtering function query.
I would need to investigate more in details the frange internals.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329489.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering results by minimum relevancy score

2017-04-11 Thread Dorian Hoxha
Can't the filter be used in cases when you're paginating in
sharded-scenario ?
So if you do limit=10, offset=10, each shard will return 20 docs ?
While if you do limit=10, _score<=last_page.min_score, then each shard will
return 10 docs ? (they will still score all docs, but merging will be
faster)

Makes sense ?

On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti <a.benede...@sease.io
> wrote:

> Can i ask what is the final requirement here ?
> What are you trying to do ?
>  - just display less results ?
> you can easily do at search client time, cutting after a certain amount
> - make search faster returning less results ?
> This is not going to work, as you need to score all of them as Erick
> explained.
>
> Function query ( as Mikhail specified) will run on a per document basis (
> if
> I am correct), so if your idea was to speed up the things, this is not
> going
> to work.
>
> It makes much more sense to refine your system to improve relevancy if your
> concern is to have more relevant docs.
> If your concern is just to not show that many pages, you can limit that
> client side.
>
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Filtering-results-by-minimum-relevancy-score-
> tp4329180p4329295.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Filtering results by minimum relevancy score

2017-04-11 Thread alessandro.benedetti
Can i ask what is the final requirement here ?
What are you trying to do ? 
 - just display less results ?
you can easily do at search client time, cutting after a certain amount
- make search faster returning less results ?
This is not going to work, as you need to score all of them as Erick
explained.

Function query ( as Mikhail specified) will run on a per document basis ( if
I am correct), so if your idea was to speed up the things, this is not going
to work.

It makes much more sense to refine your system to improve relevancy if your
concern is to have more relevant docs.
If your concern is just to not show that many pages, you can limit that
client side.






-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-results-by-minimum-relevancy-score-tp4329180p4329295.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering results by minimum relevancy score

2017-04-10 Thread Mikhail Khludnev
Here we go
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
.
But it's %100 YAGNI. You'd better tweak search to be more precise.

On Mon, Apr 10, 2017 at 7:12 PM, Ahmet Arslan 
wrote:

> Hi,
> I remember that this is possible via frange query parser.But I don't have
> the query string at hand.
> Ahmet
> On Monday, April 10, 2017, 9:00:09 PM GMT+3, David Kramer <
> david.kra...@shoebuy.com> wrote:
> I’ve done quite a bit of searching on this.  Pretty much every page I find
> says it’s a bad idea and won’t work well, but I’ve been asked to at least
> try it to reduce the number of completely unrelated results returned.  We
> are not trying to normalize the number, or display it as a percentage, and
> I understand why those are not mathematically sound.  We are relying on
> Solr for pagination, so we can’t just filter out low scores from the
> results.
>
> I had assumed that you could use score in the filter query, but that
> doesn’t appear to be the case.  Is there a special way to reference it, or
> is there another way to attack the problem?  It seems like something that
> should be allowed and possible.
>
> Thanks.
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Filtering results by minimum relevancy score

2017-04-10 Thread Ahmet Arslan
Hi,
I remember that this is possible via frange query parser.But I don't have the 
query string at hand.
Ahmet
On Monday, April 10, 2017, 9:00:09 PM GMT+3, David Kramer 
 wrote:
I’ve done quite a bit of searching on this.  Pretty much every page I find says 
it’s a bad idea and won’t work well, but I’ve been asked to at least try it to 
reduce the number of completely unrelated results returned.  We are not trying 
to normalize the number, or display it as a percentage, and I understand why 
those are not mathematically sound.  We are relying on Solr for pagination, so 
we can’t just filter out low scores from the results.

I had assumed that you could use score in the filter query, but that doesn’t 
appear to be the case.  Is there a special way to reference it, or is there 
another way to attack the problem?  It seems like something that should be 
allowed and possible.

Thanks.

Re: Filtering results by minimum relevancy score

2017-04-10 Thread Erick Erickson
Well, that's rather the point, the low-scoring docs aren't unrelated,
someone just thinks they are.

Flippancy aside, the score is, as you've researched, a bad gauge.
Since Lucene has to compute the score of a doc before it knows the
score, at any point in the collection process you may get a doc that's
10x the previous top score. Or 1/10x the previous low score.

Point being that until the complete list is assembled, you really
can't say much about any particular document.

I think it's just a bad idea to try to use _score_ for this. Rather,
refine how you query to reduce the numbers of unrelated documents.

Of course then someone will complain that "there are docs I know that
should be returned that aren't.".

You mentioned trying to use the score in a filter query. How would
that work? You don't know on the way in whether the top scoring doc
will be 100 or 1. Even a normalized score can't be computed until you
know the min/max, which you don't know until the last doc is scored.

This is the inescapable tension between precision and recall. In
essence, you're being asked to increase precision at the expense of
recall (i.e. return fewer documents that are "more relevant"). The
best way to do that is refine the query.

Of course one option is to just count on people getting tired of paging.

Best,
er...@notverymuchhelp.com

On Mon, Apr 10, 2017 at 7:59 AM, David Kramer  wrote:
> I’ve done quite a bit of searching on this.  Pretty much every page I find 
> says it’s a bad idea and won’t work well, but I’ve been asked to at least try 
> it to reduce the number of completely unrelated results returned.  We are not 
> trying to normalize the number, or display it as a percentage, and I 
> understand why those are not mathematically sound.  We are relying on Solr 
> for pagination, so we can’t just filter out low scores from the results.
>
> I had assumed that you could use score in the filter query, but that doesn’t 
> appear to be the case.  Is there a special way to reference it, or is there 
> another way to attack the problem?  It seems like something that should be 
> allowed and possible.
>
> Thanks.


Filtering results by minimum relevancy score

2017-04-10 Thread David Kramer
I’ve done quite a bit of searching on this.  Pretty much every page I find says 
it’s a bad idea and won’t work well, but I’ve been asked to at least try it to 
reduce the number of completely unrelated results returned.  We are not trying 
to normalize the number, or display it as a percentage, and I understand why 
those are not mathematically sound.  We are relying on Solr for pagination, so 
we can’t just filter out low scores from the results.

I had assumed that you could use score in the filter query, but that doesn’t 
appear to be the case.  Is there a special way to reference it, or is there 
another way to attack the problem?  It seems like something that should be 
allowed and possible.

Thanks.