dcausse added a comment.

We can inhibit tf/idf by setting the weight of the main query to 0 and use 
either "max" or "add". Note that tf/idf will still play a role to extract the 
top-N results that will be rescored. N is 8196*7 (number of shards) so if 
shards are well balanced we should cover queries that return less than 57372 
entities (with more results we have a risk that the interesting entity is 
outside the window). We can increase this number but with some perf cost. I'll 
try to extract (from cirrus logs) the number of queries that returns more than 
this number and see if we should worry about that.

Adjusting all these weights and find the proper formula is not an easy task, we 
should find a way to evaluate the performance, we could maybe take Q1 to Q1000 
and run a query with the english label and count the number of times the entity 
is in the top 10. But I don't know very well the wikidata content so there's 
certainly better tests to run.
We are building a set of tools to run those perf evaluations that could be 
useful in this case.

Concerning phrases, we have a rescore function with a strong weight but this 
one is applied only to the top-512*7 because it's very costly. There is 
techniques to optimize this process (word n-grams). If it makes sense for 
wikidata we should probably investigate in this direction as well.

Concerning PageRank I think you're right and @EBernhardson ran a test on enwiki 
and results are promising, we are building the tools needed to inject such data 
into the indices (hadoop <-> elastic). If the wikidata link graph is easy to 
extract it should be "easy" to do.


TASK DETAIL
  https://phabricator.wikimedia.org/T110648

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Sjoerddebruin, EBernhardson, aude, dcausse, Deskana, daniel, Mbch331, 
Aklapper, Lydia_Pintscher, Wikidata-bugs, Gryllida, jeremyb



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to