tfmorris added a comment.

  Is triple count the only important parameter? It seems likely that the 
descriptions could be larger, on average, than labels.
  
  It seems odd that there are more descriptions (19% of total) than labels 
(5%), although that agrees with what the previous study found 
<https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Description>.
 The strong spike at 58-61 descriptions per item tells me that some bot 
probably machine generated templated descriptions for a large number of 
languages. The fact that there are more Dutch descriptions than any other 
language 
<https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Language_distribution_of_descriptions>
 also says "machine generated" to me.
  
  Storing machine generated templated descriptions in the graph seems wasteful. 
I've observed anecdotally when working with person entities that a large number 
of them have pro-forma descriptions of <nationality> <occupation> (<birth year> 
- <death year>). These obviously don't need to be stored in the graph because 
they're just reiterating / duplicating existing information. If Wikidata 
search/autocomplete were made smarter, these could be generated on the fly.

TASK DETAIL
  https://phabricator.wikimedia.org/T337021

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, tfmorris
Cc: tfmorris, Manuel, Aklapper, Lydia_Pintscher, Astuthiodit_1, AWesterinen, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, 
Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to