EBernhardson claimed this task. EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board. EBernhardson added a comment.
The UI for adding statements is using wbsearchentities <https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentities&format=json&uselang=en&errorformat=plaintext&search=asse&language=en&type=lexeme&formatversion=2> (explain <https://www.wikidata.org/w/api.php?action=wbsearchentities&search=asse&format=json&errorformat=plaintext&language=en&uselang=en&type=lexeme&cirrusDumpResult&cirrusExplain=pretty>). Target results are L1191921 and L1144955. The method of scoring for websearchentities could be sumarized as bucketing results into 3 groups based on how well they match, and then sorting by popularity (statement count and incoming link counts) within those buckets. Of all the docs that make the best possible match (near_match on lemma or near_match on lexeme_forms.representation) the two target documents have the lowest popularity with zero incoming links and a single statement each. Reviewing a few of the documents that were not targeted but ranked higher, they also match lexme_forms.representation. In a more traditional search context using term frequencies the fact that the target lexmes have a single statement each would push them up in the ranking, but because wbsearchentities buckets the results isn't of giving them individual scores that doesn't happen here. One thing we could do is be less strict on the bucketing. In a quick test setting a dismax tie breaker of 0.02 gives these target documents a boost up to the top of the ranking. This is not directly configurable, it was set in the initial commit for WikibaseLexemeCirrusSearch and never changed. This does read from our profile service at least, so it shouldn't be too hard to add a custom profile parameter to control the dismax tie breaker and set this to something that works a bit better. What value is appropriate is hard to say, at 0.01 these docs get a boost up into the top-7, but not all the way to the top. Essentially what ends up pushing these docs to the top of the ranking with the tie breaker is that they match both the lemma and lexeme_forms.representation field, where the other docs only match one of the two fields. TASK DETAIL https://phabricator.wikimedia.org/T348877 WORKBOARD https://phabricator.wikimedia.org/project/board/1227/ EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: EBernhardson Cc: EBernhardson, Gehel, Nikki, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, EBjune, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
