daniel added a comment.

In https://phabricator.wikimedia.org/T119066#1824919, @aude wrote:

> @daniel if you would like "encyclopedia of life" to be the first result for 
> searching "life", then incoming links alone might be good for scoring
>
> life (Q3) has 56 incoming links
>
> encyclopedia of life (Q82486) has 1365362 incoming links


Ah, right... we'd want to consider only links from main snaks, not from 
references (nto sure about qualifiers). That would need some work...

> I'm not sure that *not* doing tf/idf is the solution, but we can investigate.

Term frequency doesn't seem to be a good indicator in our use case.

> The way we munge all the different terms in all the languages together in one 
> field is probably not ideal for tf/idf.  "life" is probably translated 
> differently in most languages whereas "Half Life" (Q752241) is generally not 
> translated yet has labels in lots of languages, so "life" is especially 
> frequent.  If we could consider just english when searching in english, then 
> "Half Life" probably is not boosted as much compared to "life".

Yes, this should be per language.

> As well, things like exact title matches don't really work currently for 
> Wikidata. Ideally, we would consider exact label matches in the search 
> language and exact matches would get a boost.

Indeed.

> I think considering other attributes (e.g. # of site links, # of statements, 
> etc) of the document to boost scoring could help. This would not replace 
> considering incoming links but just be additional consideration in scoring. 
> It already works okayish enough in the entity selector. Once we put these in, 
> then we can try different rescorings to see what works well.  If this turns 
> out to be a bad idea, then we can remove the custom rescoring config for 
> wikidata and do as we do now.

Number of sitelinks or statements can help. I'd like to avoid gettign too many 
parameterrs into the mix, though. If we can, let's find one or two indicators 
that work well. If there are too many factors, things tend to be come 
unpredictable.

My objection to sitelinks was based on the assumption that we already have 
something better (incoming links), so why invest time into the sitelinks stuff. 
But as you point out, the raw number of incoming links includes links from 
references, and can thus be misleading.


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, daniel
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to