Cheetah90 added a subscriber: Cheetah90. Cheetah90 added a comment. Hi Jan and Lydia,
I am one of the PhD students who are working on this main/sub-article relationship project. The problem we identified is that the assumption of one wikipedia article match one concept (might be equivalent to Wikidata item?) need to be improved for content completeness concern. For example, when AI system used content of "United States" article as their understanding of this concept, it ignores all the content in "History of the United States", "Geography of the United States" etc. which are integral part of the content. (Yes, the main article "United States" will have History and Geography sections, but they are just summaries.) The main/sub-article relationship problem gets complicated in multilingual Wikipedia. For example, in multilingual wikipedia, if you check the "Conspiracy theory <https://en.wikipedia.org/wiki/Conspiracy_theory>", English Wikipedia will have only sub-articles "List of conspiracy theories <https://en.wikipedia.org/wiki/List_of_conspiracy_theories>" and "Conspiracy theories in the Arab world <https://en.wikipedia.org/wiki/Conspiracy_theories_in_the_Arab_world>" but French Wikipedia will have other unique sub-articles such as "Théories du complot maçonnique <https://fr.wikipedia.org/wiki/Th%C3%A9ories_du_complot_ma%C3%A7onnique>" and "Théorie du complot juif <https://fr.wikipedia.org/wiki/Th%C3%A9orie_du_complot_juif>"; Chinese version will have sub-articles "SARS阴谋论 <https://zh.wikipedia.org/wiki/SARS%E9%99%B0%E8%AC%80%E8%AB%96>". If we can successfully resolve the main/sub-article relationship, we should be able to benefits from the Wikipedia content in diverse knowledge. Since the main/sub-article is not well defined (or at least not well understood and executed by the Wikipedia editors), there are a lot of false positive and true negative if solely looking at {Main} template. Currently, we identified the {Main} {See also} {Further} templates as candidate sets of the sub articles. We are extracting some features to run machine learning algorithm to correctly classify the true and false sub articles. Features that we are considering includes, Semantic Relatedness between the main/sub articles, PageRankRatio between the main/sub-articles, TokenOverlap, TokenSynonym/Demonymn between the main/sub-articles. I feel I know too little about Wikidata to provide direct and meaningful ways for integration. But as Aaron suggested, we just want to see if our project can be helpful to Wikidata in some ways. Allen TASK DETAIL https://phabricator.wikimedia.org/T117875 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Lydia_Pintscher, Cheetah90 Cc: Cheetah90, JanZerebecki, Lydia_Pintscher, Aklapper, StudiesWorld, aude, Halfak, Wikidata-bugs, Mbch331 _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
