tfmorris added a comment.
Is triple count the only important parameter? It seems likely that the descriptions could be larger, on average, than labels. It seems odd that there are more descriptions (19% of total) than labels (5%), although that agrees with what the previous study found <https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Description>. The strong spike at 58-61 descriptions per item tells me that some bot probably machine generated templated descriptions for a large number of languages. The fact that there are more Dutch descriptions than any other language <https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Vertical_Analysis#Language_distribution_of_descriptions> also says "machine generated" to me. Storing machine generated templated descriptions in the graph seems wasteful. I've observed anecdotally when working with person entities that a large number of them have pro-forma descriptions of <nationality> <occupation> (<birth year> - <death year>). These obviously don't need to be stored in the graph because they're just reiterating / duplicating existing information. If Wikidata search/autocomplete were made smarter, these could be generated on the fly. TASK DETAIL https://phabricator.wikimedia.org/T337021 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, tfmorris Cc: tfmorris, Manuel, Aklapper, Lydia_Pintscher, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
