Re: SOLR Score Range Changed

2018-02-26 Thread Shawn Heisey
On 2/23/2018 2:28 PM, Hodder, Rick wrote:
> Combining everything into one query is what I'd prefer because as you said, 
> one would think that with everything in the same query, the score would 
> organize everything nicely.

I don't recall writing anything like that.  How did you infer that from
what I wrote?  One thing that you can infer from what I said is that
comparing scores from multiple queries is not going to do what you think
it will do.  Which leads into the next thing I'll quote from your message:

> So the way we had addressed it was running 3 separate SOLR queries and 
> combining them and sorting them by descending score - wasn’t perfect, but it 
> worked, and helped me to reduce the number of results we hand off to a 
> scoring engine that applies 3 algorithms (Monge-Elkan, Jaro-Winkler, and 
> SmithWindowed Affline) to further hone the results - which can take LOTS of 
> time if there are a lot of results, so 

It seems that you didn't finish your sentence, and may not have even
finished the message, as this was the last thing you wrote.

Running three separate queries and then trying to combine them based on
score is not something you should ever attempt, because as I mentioned
before, the absolute score of a document in a result is only meaningful
for that specific query done at that moment.  Even the same query done
later after something has changed might have a very different score range.

Thanks,
Shawn



RE: SOLR Score Range Changed

2018-02-23 Thread Hodder, Rick
Classic Similarity helped, but the ranges of values don’t have a min near 0 
like back in 4's version



Are there other attributes/elements to this factory that could get me back the 
old functionality?

-Original Message-
From: Joël Trigalo [mailto:jtrig...@gmail.com] 
Sent: Friday, February 23, 2018 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: SOLR Score Range Changed

The difference seems due to the fact that default similarity in solr 7 is
BM25 while it used to be TF-IDF in solr 4. As you realised, BM25 function is 
smoother.
You can configure schema.xml to use ClassicSimilarity, for instance 
https://lucene.apache.org/solr/guide/6_6/major-changes-from-solr-5-to-solr-6.html#default-similarity-changes
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#FieldTypeDefinitionsandProperties-FieldTypeSimilarity

But as said before, maybe you are using properties that are not guaranteed so 
it would be better to change score function or sorting (rather than coming back 
to ClassicSimilarity)



RE: SOLR Score Range Changed

2018-02-23 Thread Hodder, Rick
Hi Shawn,

Thanks for your help - I'm still finding my way in the weeds of SOLR.

Combining everything into one query is what I'd prefer because as you said, one 
would think that with everything in the same query, the score would organize 
everything nicely.

>>Assuming you're using the default relevancy sort
Yes

>> does the order of your search results change dramatically from one version 
>> to the other?  If it does, is the order generally better from a relevance 
>> standpoint, or generally worse?  If you are specifying an explicit sort, 
>> then the scores will likely be ignored.

Here's what we do - we have a list of policies with names (among other things, 
but I'll just use names for an example.

We search for several business names to see if we have policies in common with 
the names so that we don’t have too much risk with them.

So let's say I'm doing a search against three business names

Bob's carpentry
Conslidated carpentry of the Greater North West
Carpentry Land

q=(IDX_CompanyName:bob's AND carpentry) OR (IDX_CompanyName: conslidated AND 
carpentry AND of AND the AND Greater AND North AND West) OR (IDX_CompanyName: 
Carpentry AND Land)

Searching for 750 rows has hits that are all focused on Consolidated (seemingly 
because the number of words causes the SOLR score to go up into a higher range 
for all Consolidated results, as mentioned in my previous email.) Searching for 
all 3 things at the same time doesn’t insure that all 3 companies will be in 
the results, even when run separately there are results for all 3. If I boost 
maxrows to 4000, I see a few bob's carpentry but most are still Consolidated

So the way we had addressed it was running 3 separate SOLR queries and 
combining them and sorting them by descending score - wasn’t perfect, but it 
worked, and helped me to reduce the number of results we hand off to a scoring 
engine that applies 3 algorithms (Monge-Elkan, Jaro-Winkler, and SmithWindowed 
Affline) to further hone the results - which can take LOTS of time if there are 
a lot of results, so 


What I am describing is also why it's strongly recommended that you never try 
to convert scores to percentages:

https://wiki.apache.org/lucene-java/ScoresAsPercentages

Thanks,
Shawn



Re: SOLR Score Range Changed

2018-02-23 Thread Joël Trigalo
The difference seems due to the fact that default similarity in solr 7 is
BM25 while it used to be TF-IDF in solr 4. As you realised, BM25 function
is smoother.
You can configure schema.xml to use ClassicSimilarity, for instance
https://lucene.apache.org/solr/guide/6_6/major-changes-from-solr-5-to-solr-6.html#default-similarity-changes
https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#FieldTypeDefinitionsandProperties-FieldTypeSimilarity

But as said before, maybe you are using properties that are not guaranteed
so it would be better to change score function or sorting (rather than
coming back to ClassicSimilarity)

2018-02-22 18:39 GMT+01:00 Shawn Heisey :

> On 2/22/2018 9:50 AM, Hodder, Rick wrote:
>
>> I am migrating from SOLR 4.10.2 to SOLR 7.1.
>>
>> All seems to be going well, except for one thing: the score that is
>> coming back for the resulting documents is giving different scores.
>>
>
> The absolute score has no meaning when you change something -- the index,
> the query, the software version, etc.  You can't compare absolute scores.
>
> What matters is the relative score of one document to another *in the same
> query*.  The amount of difference is almost irrelevant -- the goal of
> Lucene's score calculation gymnastics is to have one document score higher
> than another, so the *order* is reasonably correct.
>
> Assuming you're using the default relevancy sort, does the order of your
> search results change dramatically from one version to the other?  If it
> does, is the order generally better from a relevance standpoint, or
> generally worse?  If you are specifying an explicit sort, then the scores
> will likely be ignored.
>
> What I am describing is also why it's strongly recommended that you never
> try to convert scores to percentages:
>
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> Thanks,
> Shawn
>
>


Re: SOLR Score Range Changed

2018-02-22 Thread Shawn Heisey

On 2/22/2018 9:50 AM, Hodder, Rick wrote:

I am migrating from SOLR 4.10.2 to SOLR 7.1.

All seems to be going well, except for one thing: the score that is coming back 
for the resulting documents is giving different scores.


The absolute score has no meaning when you change something -- the 
index, the query, the software version, etc.  You can't compare absolute 
scores.


What matters is the relative score of one document to another *in the 
same query*.  The amount of difference is almost irrelevant -- the goal 
of Lucene's score calculation gymnastics is to have one document score 
higher than another, so the *order* is reasonably correct.


Assuming you're using the default relevancy sort, does the order of your 
search results change dramatically from one version to the other?  If it 
does, is the order generally better from a relevance standpoint, or 
generally worse?  If you are specifying an explicit sort, then the 
scores will likely be ignored.


What I am describing is also why it's strongly recommended that you 
never try to convert scores to percentages:


https://wiki.apache.org/lucene-java/ScoresAsPercentages

Thanks,
Shawn



SOLR Score Range Changed

2018-02-22 Thread Hodder, Rick
I am migrating from SOLR 4.10.2 to SOLR 7.1.

All seems to be going well, except for one thing: the score that is coming back 
for the resulting documents is giving different scores.

The core uses a schema. Here's the schema info for the field that i am 
searching on:




When searching maxrows=750, fields: *,score

IDX_Company:(cat and scratch)

SOLR 7.1: max score 6.95 and a min of 6.28

SOLR 4.10.2: max score 8.63 and a min of 0.91

IDX_InsuredName:(cat and scratch and fever)

SOLR 7.1 max score of 12.99 and a min of 11.25 SOLR 4.10.2 max 3.97 and min of 
0.77

See how the range of values is different (ranges in 7.1 dont go down to 0.x) 
Also notice that the max score doubles when I add one word to the search terms 
in 7.1. Most important, the ranges in 4.10.2 overlap - but the 7.1 dont.

A little more information to show you how I use this information, and why this 
is causing a problem.

I get a company name like "bobs cabinetry" and another "all american tech 
enterprise"

I run two SOLR queries per company name, I'll call them 1-AND, 1-OR, 2-AND, 
2-OR.

IDX_Company:(bobs AND cabinetry) =*,score,requestid:"1-AND"
IDX_Company:(bobs OR cabinetry) =*,score,requestid:"1-OR"
IDX_Company:(all AND american AND tech AND enterprise) 
=*,score,requestid:"2-AND"
IDX_Company:(all OR american OR tech OR enterprise) =*,score,requestid:"2-OR"

I combine the results together sort by descending score, and then take the top 
750 rows.(The requestid lets me know which query the results came from)

Because of the changes in the range of scores, the sort pushes all of the all 
american tech enterprise rows to the top of the results (because of no 
overlap), and when the top 750 are taken everything for bobs carpentry is 
removed from the results.

Is there some config setting I can change to make score calculation act like it 
did in 4.10.2?

Or something else?