[Wikidata-bugs] [Maniphest] T348877: Lexeme searches prefer forms over lemmas

EBernhardson Mon, 13 Nov 2023 14:38:23 -0800

EBernhardson claimed this task.
EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the 
Discovery-Search (Current work) board.
EBernhardson added a comment.



  The UI for adding statements is using wbsearchentities 
<https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentities&format=json&uselang=en&errorformat=plaintext&search=asse&language=en&type=lexeme&formatversion=2>
 (explain 
<https://www.wikidata.org/w/api.php?action=wbsearchentities&search=asse&format=json&errorformat=plaintext&language=en&uselang=en&type=lexeme&cirrusDumpResult&cirrusExplain=pretty>).
 Target results are L1191921 and L1144955.
  
  The method of scoring for websearchentities could be sumarized as bucketing 
results into 3 groups based on how well they match, and then sorting by 
popularity (statement count and incoming link counts) within those buckets. Of 
all the docs that make the best possible match (near_match on lemma or 
near_match on lexeme_forms.representation) the two target documents have the 
lowest popularity with zero incoming links and a single statement each. 
Reviewing a few of the documents that were not targeted but ranked higher, they 
also match lexme_forms.representation.  In a more traditional search context 
using term frequencies the fact that the target lexmes have a single statement 
each would push them up in the ranking, but because wbsearchentities buckets 
the results isn't of giving them individual scores that doesn't happen here.
  
  One thing we could do is be less strict on the bucketing.  In a quick test 
setting a dismax tie breaker of 0.02 gives these target documents a boost up to 
the top of the ranking. This is not directly configurable, it was set in the 
initial commit for WikibaseLexemeCirrusSearch and never changed.  This does 
read from our profile service at least, so it shouldn't be too hard to add a 
custom profile parameter to control the dismax tie breaker and set this to 
something that works a bit better.  What value is appropriate is hard to say, 
at 0.01 these docs get a boost up into the top-7, but not all the way to the 
top.  Essentially what ends up pushing these docs to the top of the ranking 
with the tie breaker is that they match both the lemma and 
lexeme_forms.representation field, where the other docs only match one of the 
two fields.

TASK DETAIL
  https://phabricator.wikimedia.org/T348877

WORKBOARD
  https://phabricator.wikimedia.org/project/board/1227/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: EBernhardson
Cc: EBernhardson, Gehel, Nikki, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Mahir256, QZanden, EBjune, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T348877: Lexeme searches prefer forms over lemmas

Reply via email to