daniel added a subscriber: daniel.
daniel added a comment.
Using the sitelink count for scoring was intended to be a workaround. Cirrus
already has the number of incoming links ("in-degree") for each item, which it
uses for scoring per default. Why is that not good enough for our case?
The main problem with the current scoring seems to be that Cirrus uses tf/idf
scoring. The "tf" bit ("term frequency", the number of times the search term
occurs in the document) should not be used for wikidata items, it's not a good
indicator of relevance. The "idf" bit is intended to reduce the impact of
irrelevant (too common) terms in the search string - which is useless for
single word (or prefix) searches.
If we want to improve scoring, we should make sure that in-degree is used, and
tf/idf is not used.
TASK DETAIL
https://phabricator.wikimedia.org/T119066
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: aude, daniel
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs