dcausse added a subscriber: dcausse. dcausse added a comment. //First of all: sorry for all the low level details in this comment but it's always complex to tackle such relevance issues.//
I assume that `life` is the query. Wikidata already uses `incoming_link` to boost the top-N results (8196 docs per shards). The way cirrus scores documents for wikidata is : 1. The lucene score (applied to all docs). NOTE: When I talk about top-N docs below this is according to this ranking. 2. The phrase rescore: if the query has more than 1 word, the doc is overboosted if it contains the same sequence of adjacent words. Only the top-N docs are analyzed (N=512 per shards here because it's very costly). This does not apply here because the query is one word. 3. Special:Search on wikidata is configured to query 2 namespaces (0 and 120). Boost for ns 0 is 0.05 and for ns 120 is 0.2 (top-8196 docs per shards analyzed). I assume this is not related to our problem because there's only 10 properties <https://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=advanced&search=life&fulltext=Search&ns120=1&profile=advanced> related to //life//. 4. The number of incoming links (top-8196 docs per shards analyzed). A small note on the lucene score: Lucene scores docs using a tf.idf formula this formula also includes a normalization based on document size. Large documents tend to be ranked lower, this understandable because large docs may have higher term frequencies and thus higher raw tf.idf scores, normalization on size helps to mitigate this problem. Why does it affect wikidata? Because we flatten all the data into the same field, a wikibase entity with a lot of labels in many different languages (likely to happen for high profile items) will be larger than //less important items// and thus have a lower lucene score. Because of the current cirrus<->wikidata mapping problems we're trying to address (everything is in the same field so no boosts on title/redirects can be applied) it's very likely that the incoming_link boost will take precedence over lucene score and from what I see: life has a low number of incoming_link <https://www.wikidata.org/wiki/Q3?action=cirrusDump> (53) compared to Encyclopedia of Life <https://www.wikidata.org/wiki/Q82486?action=cirrusDump> which has //1 081 079// incoming links. On the other hand the third result has only 32 incoming_links <https://www.wikidata.org/wiki/Q752241?action=cirrusDump>. Why Q3 has a bad lucene score? Let's compare Q3 (ranked ~700) and Q752241 (ranked 4) - Q3 lucene score is 0.5476983 - Q752241 lucene score is 0.85728467 This is because there's only 10 occurrences of the word life in the content for Q3 and 64 for Q752241 and Q3 is larger (length norm effect). The boost on incoming link is : - Q3: should be something like log(2+53) but it's 0.69897 <- **completely wrong** - it looks it's log(2+3) - Q279744: should be something like log(2+32) and it's 1.5314789 which is good. So looks like the problem is because the number of incoming links stored in elasticsearch does not reflect the actual number. This is normal in certain conditions: we have an optimization to not update docs too frequently, so if the number of incoming links does not change more than 20% we ignore the update. But here it's way more than 20% it's a 1700% difference... I'm not sure what's happened here... Would it be possible to update Q3 to force a re-index of this entity and see if it fixes the issue? If yes then we will certainly have to write a maintenance script to check this incoming_link consistency. Side note: as you can see lucene score is rather bad for Q3, so scoring is very fragile on wikidata. This cannot be addressed without all the work planned to add a better cirrus<>wikidata integration. TASK DETAIL https://phabricator.wikimedia.org/T110648 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: dcausse, Deskana, daniel, Mbch331, Aklapper, Lydia_Pintscher, Wikidata-bugs, aude, Gryllida, jeremyb _______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs