[ 
https://issues.apache.org/jira/browse/SOLR-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060983#comment-14060983
 ] 

Brian commented on SOLR-6243:
-----------------------------

Looking at the 3.5 vs. 4.8.1 src:

Originally the code was like this:

        // full phrase...
        addShingledPhraseQueries(query, normalClauses, phraseFields, 0, 
                                 tiebreaker, pslop);
        // shingles...
        addShingledPhraseQueries(query, normalClauses, phraseFields2, 2,  
                                 tiebreaker, pslop);
        addShingledPhraseQueries(query, normalClauses, phraseFields3, 3,
                                 tiebreaker, pslop);

-I.e., the whole set of phrases fields was passed to addShingledPhraseQueries.

In 4.8.1, it was changed so that all individual phrase fields are returned and 
added (ExtendedDismaxConfiguration builds this by calling "addAll" for each pf 
list - so all the fields are included in this.  Now it calls 
"addShingledPhraseQueries" individually for each field, instead of the set of 
fields - which is why it is built incorrectly:

List<FieldParams> allPhraseFields = config.getAllPhraseFields();  //(gets ALL 
INDIVIDUAL fields... 
...
//now each individual field is added - this is incorrect
 for (FieldParams phraseField: allPhraseFields) {
        Map<String,Float> pf = new HashMap<>(1);
        pf.put(phraseField.getField(),phraseField.getBoost());
        addShingledPhraseQueries(query, normalClauses, pf,   
            phraseField.getWordGrams(),config.tiebreaker, 
phraseField.getSlop());
      }

Probably the mistake is that  config.getAllPhraseFields(); should return a list 
of lists - the fields for each of pf, pf2, and pf3, instead, of a list of all 
the fields included in those.  Perhaps when the original code was re-factored 
and split into separate classes this functionality was misinterpreted.  I'm 
surprised this has gone unnoticed for so long - no one else noticed incorrect 
queries?  Also this code is very ugly and difficult to read  - what happened to 
clean code and code reviews?

> eDisMax hidden change - no longer applies disjunction max to "pf" query
> -----------------------------------------------------------------------
>
>                 Key: SOLR-6243
>                 URL: https://issues.apache.org/jira/browse/SOLR-6243
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>    Affects Versions: 4.8.1
>            Reporter: Brian
>              Labels: edismax, extendedDisMax, pf, phrase
>
> At some point after Solr 3.5 a bug was introduced into eDisMax (Extended 
> DisMax Query parser) that is still there as of Solr 4.8.1.  The "pf" part of 
> the query (full phrase query) no longer is applied as a disjunction max query 
> - instead all the matching field scores are simply added to the total score.  
> I.e. they are just added together as opposed to the max being taken + 
> tie-breaker times the sum of the other match scores.
> This changes the scores and the rankings significantly.  When upgrading from 
> Solr 3.5, one of our relevance test measures showed target results dropping 
> over a full rank due to this bug.  On key result went from being at rank 7 to 
> past rank 40.  I do not see any easy workaround for this.
> The following is a comparison between query results for Solr 3.5 and Solr 
> 4.8, for one query, showing the "pf" parts of the query and scores.
> Turning debug query on, the results are the following,  They clearly show 
> that that max is used with the tiebreaker in 3.5 but not 4.8 for pf: 
> query (3.5): 
> boost(+(((inlink_text:edg^1.2 | body:edg^0.5 | title:edg^1.2 | 
> meta_description:edg^0.5 | url_path:edg^1.2 | file_name:edg^1.2 | 
> primary_header:edg^1.2 | secondary_header:edg^0.5)~0.17 
> (inlink_text:detect^1.2 | body:detect^0.5 | title:detect^1.2 | 
> meta_description:detect^0.5 | url_path:detect^1.2 | file_name:detect^1.2 | 
> primary_header:detect^1.2 | secondary_header:detect^0.5)~0.17)~2) 
> (inlink_text:"edg detect"~100^1.2 | body:"edg detect"~100^0.5 | title:"edg 
> detect"~100^1.2 | meta_description:"edg detect"~100^0.5 | url_path:"edg 
> detect"~100^1.2 | file_name:"edg detect"~100^1.2 | primary_header:"edg 
> detect"~100^1.2 | secondary_header:"edg 
> detect"~100^0.5)~0.17,product(float(hier_score),pow(float(link_score),const(0.25))))
>  
> I.e., the "pf" part of the query has the following disjunction max form:
> (inlink_text:"edg detect"~100^1.2 | body:"edg detect"~100^0.5 | ... | 
> secondary_header:"edg detect"~100^0.5)~0.17
> pf results for one (3.5): 
> <lst>
> <bool name="match">true</bool>
> <float name="value">1.5689207</float>
> <str name="description">max plus 0.17 times others of:</str>
> <arr name="details">
> <lst>
> <bool name="match">true</bool>
> <float name="value">1.5596248</float>
> <str name="description">...</str>
> <arr name="details">...</arr>
> </lst>
> <lst>
> <bool name="match">true</bool>
> <float name="value">0.054681662</float>
> <str name="description">...</str>
> <arr name="details">...</arr>
> </lst>
> </arr>
> However, in 4.8, "max" and the tie-breaker are nowhere to be seen for the pf 
> part of the query: 
> query (4.8): 
> boost(+(((inlink_text:edg^1.2 | body:edg^0.5 | title:edg^1.2 | 
> meta_description:edg^0.5 | url_path:edg^1.2 | file_name:edg^1.2 | 
> primary_header:edg^1.2 | secondary_header:edg^0.5)~0.17 
> (inlink_text:detect^1.2 | body:detect^0.5 | title:detect^1.2 | 
> meta_description:detect^0.5 | url_path:detect^1.2 | file_name:detect^1.2 | 
> primary_header:detect^1.2 | secondary_header:detect^0.5)~0.17)~2) body:"edg 
> detect"~100^0.5 title:"edg detect"~100^1.2 url_path:"edg detect"~100^1.2 
> file_name:"edg detect"~100^1.2 primary_header:"edg detect"~100^1.2 
> secondary_header:"edg detect"~100^0.5 meta_description:"edg detect"~100^0.5 
> inlink_text:"edg 
> detect"~100^1.2,product(float(hier_score),pow(float(link_score),const(0.25))))
>  
> I.e., the "pf" part of the query does NOT have the disjunction max form:
> body:"edg detect"~100^0.5 title:"edg detect"~100^1.2 ... inlink_text:"edg 
> detect"~100^1.2,
> pf results for one (4.8) (no max, both values are just listed under the "sum 
> of" element: 
> <lst>
> <bool name="match">true</bool>
> <float name="value">0.03554287</float>
> <str name="description">...</str>
> <arr name="details">...</arr>
> </lst>
> <lst>
> <bool name="match">true</bool>
> <float name="value">1.0933692</float>
> <str name="description">...</str>
> <arr name="details">...</arr>
> </lst>
> The Solr 4 handler used is the following - it's also the same as the 3.5 one: 
>  <requestHandler class="solr.SearchHandler" name="/sitewide">
>     
>      <lst name="defaults">
>        <str name="defType">edismax</str>
>        <str name="echoParams">explicit</str>
>         <float name="tie">0.17</float>
>          <str name="qf">
>            body^0.5 title^1.2 url_path^1.2 file_name^1.2 primary_header^1.2 
> secondary_header^0.5 meta_description^0.5 inlink_text^1.2 
>          </str>
>          <str name="pf">
>            body^0.5 title^1.2 url_path^1.2 file_name^1.2 primary_header^1.2 
> secondary_header^0.5 meta_description^0.5 inlink_text^1.2 
>          </str>
>          <int name="ps">100</int>
>      <str name="boost">
>        hier_score 
>      </str>
>      <str name="boost">
>        pow(link_score,0.25) 
>      </str>
>      </lst>
>      <lst name="spellchecker">
>       
>       <str name="spellcheck.onlyMorePopular">false</str>
>       
>       <str name="spellcheck.extendedResults">true</str>
>       
>       <str name="spellcheck.count">3</str>
>       <str name="buildOnCommit">true</str>
>      </lst>
>      <arr name="last-components">
>        <str>spellcheck</str>
>      </arr>
>   </requestHandler>



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to