[Wikidata-bugs] [Maniphest] [Commented On] T117875: Capture article/sub-article relationship in Wikidata items

Cheetah90 Tue, 24 Nov 2015 18:13:03 -0800

Cheetah90 added a subscriber: Cheetah90.
Cheetah90 added a comment.

Hi Jan and Lydia,


I am one of the PhD students who are working on this main/sub-article 
relationship project. The problem we identified is that the assumption of one 
wikipedia article match one concept (might be equivalent to Wikidata item?)  
need to be improved for content completeness concern. For example, when AI 
system used content of "United States" article as their understanding of this 
concept, it ignores all the content in "History of the United States", 
"Geography of the United States" etc. which are integral part of the content. 
(Yes, the main article "United States" will have History and Geography 
sections, but they are just summaries.)

The main/sub-article relationship problem gets complicated in multilingual 
Wikipedia. For example, in multilingual wikipedia, if you check the "Conspiracy 
theory <https://en.wikipedia.org/wiki/Conspiracy_theory>", English Wikipedia 
will have only sub-articles "List of conspiracy theories 
<https://en.wikipedia.org/wiki/List_of_conspiracy_theories>" and "Conspiracy 
theories in the Arab world 
<https://en.wikipedia.org/wiki/Conspiracy_theories_in_the_Arab_world>" but 
French Wikipedia will have other unique sub-articles such as "Théories du 
complot maçonnique 
<https://fr.wikipedia.org/wiki/Th%C3%A9ories_du_complot_ma%C3%A7onnique>" and 
"Théorie du complot juif 
<https://fr.wikipedia.org/wiki/Th%C3%A9orie_du_complot_juif>"; Chinese version 
will have sub-articles "SARS阴谋论 
<https://zh.wikipedia.org/wiki/SARS%E9%99%B0%E8%AC%80%E8%AB%96>". If we can 
successfully resolve the main/sub-article relationship, we should be able to 
benefits from the Wikipedia content in diverse knowledge.

Since the main/sub-article is not well defined (or at least not well understood 
and executed by the Wikipedia editors), there are a lot of false positive and 
true negative if solely looking at {Main} template. Currently, we identified 
the {Main} {See also} {Further} templates as candidate sets of the sub 
articles. We are extracting some features to run machine learning algorithm to 
correctly classify the true and false sub articles. Features that we are 
considering includes, Semantic Relatedness between the main/sub articles, 
PageRankRatio between the main/sub-articles, TokenOverlap, 
TokenSynonym/Demonymn between the main/sub-articles.

I feel I know too little about Wikidata to provide direct and meaningful ways 
for integration. But as Aaron suggested, we just want to see if our project can 
be helpful to Wikidata in some ways.

Allen


TASK DETAIL
  https://phabricator.wikimedia.org/T117875

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lydia_Pintscher, Cheetah90
Cc: Cheetah90, JanZerebecki, Lydia_Pintscher, Aklapper, StudiesWorld, aude, 
Halfak, Wikidata-bugs, Mbch331



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T117875: Capture article/sub-article relationship in Wikidata items

Reply via email to