daniel added a subscriber: daniel.
daniel added a comment.

Using the sitelink count for scoring was intended to be a workaround. Cirrus 
already has the number of incoming links ("in-degree") for each item, which it 
uses for scoring per default. Why is that not good enough for our case?

The main problem with the current scoring seems to be that Cirrus uses tf/idf 
scoring. The "tf" bit ("term frequency", the number of times the search term 
occurs in the document) should not be used for wikidata items, it's not a good 
indicator of relevance. The "idf" bit is intended to reduce the impact of 
irrelevant (too common) terms in the search string - which is useless for 
single word (or prefix) searches.

If we want to improve scoring, we should make sure that in-degree is used, and 
tf/idf is not used.


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, daniel
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to