dcausse added a comment.
Sorry... I was completely wrong when analyzing lucene explain for Q3 (it's a
pain to debug scoring issues
<https://www.wikidata.org/w/index.php?title=Special:Search&limit=10&offset=850&profile=default&search=life&cirrusDumpResult&cirrusExplain>
).
I think I've read another entity.
Q3 lucene score is 0.1824194
Boost link score is: 1.763428 ~= log(2+53) so it's OK
Namespace boost: 0.05
Final score will be : 0.1824194 * 1.763428 * 0.05 => 0.016084173
Here is few examples:
| entity | Number of words | Life freq | Lucene | Links | ns | final |
rank | desc
|
| Q3 | 830 | 9 | 0.1824194 | 1.763428 | 0.05 |
0.01608417 | ~800 | The lucene score is very bad
|
| Q752241 | 280 | 64 | 0.8565265 | 1.5314789
| 0.05 | 0.06558761 | 4 | The lucene score is good and incoming_link is OK
|
| Q171972 | 89 | 34 | 1.075165 | 0.7781513 | 0.05
| 0.041832052 | 20 | Incoming link is bad but lucene score is good even if
there's only 34 occurrences, this is because the size norm (89 vs 280 for
Q752241) |
So clearly it's because of the bad lucene score.
So I was wrong : incoming links won't take precedence.
But I can't explain why this has changed in August... :(
To sum up:
- fixing the bad lucene score will require a better cirrus <> wikidata
integration to allow more complex queries with dedicated fields and boosts.
- workaround could be to write a custom rescore profile with a new numeric or
by overboosting incoming links (maybe completely inhibit lucene score for now).
Could be addressed by https://gerrit.wikimedia.org/r/#/c/249460/
TASK DETAIL
https://phabricator.wikimedia.org/T110648
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse
Cc: aude, dcausse, Deskana, daniel, Mbch331, Aklapper, Lydia_Pintscher,
Wikidata-bugs, Gryllida, jeremyb
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs