Out of idle curiosity ...

Are there significant numbers of articles NOT tagged by any WikiProject? In my 
experience on-wiki, any article (apart from ones recently created) are tagged 
by one or more WikiProjects. 

I guess the converse question is what articles are the most tagged by 
WikiProjects? I am often surprised at how many WikiProjects jump in to tag some 
article I have created (I am more likely to notice the tagging of articles I 
create because they automatically go on my watchlist).

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[email protected]] On 
Behalf Of Isaac Johnson
Sent: Thursday, 16 January 2020 6:54 AM
To: Research into Wikimedia content and communities 
<[email protected]>
Subject: [Wiki-research-l] New dataset of articles tagged by WikiProjects

Hey Research Community,
TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_Templates/10248344

More details:

I wanted to notify everyone that we have published a dataset of the articles on 
English Wikipedia that have been tagged by WikiProjects [1] through templates 
on their associated talk pages. We are not planning to make this an ongoing 
release, but I have provided the script that I used to generate it in the 
Figshare item so that others might update / adjust to meet their needs.

As anyone who has done research on WikiProjects knows, it can be complicated to 
determine what articles fit under a particular WikiProject's purview. The 
motivation for generating this dataset was to support our work in developing 
topic models for Wikipedia (see [2] for an overview), but we imagine that there 
are many other ways in which this dataset might be
useful:

* Previous work has examined how active WikiProjects are based on edits to 
their pages in the Wikipedia namespace. This dataset makes it much easier to 
identify which Wikiprojects are managing the most valuable articles on 
Wikipedia (in terms of quality or pageviews).

* Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects 
represent an alternative that often is simpler and still quite rich. For 
instance, this could be used for temporal analyses of article quality, demand, 
or distribution by topic.

* While WikiProjects are English-only and therefore limited in their utility to 
other languages, we also provide the Wikidata ID and sitelinks
-- i.e. titles for corresponding articles in other languages -- to allow for 
multilingual analyses. This could be used to compare gaps in coverage
-- e.g., akin to past work that has used categories [3].

The main challenge, besides processing time, is how to 1) effectively extract 
the WikiProject templates from talk pages, and, 2) consistently link them to a 
canonical WikiProject name and topic. For example, the canonical template for 
WikiProject Medicine is 
https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one 
used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no 
(and there are 13 more). To capture articles tagged with these many templates 
and all link them to the same canonical WikiProject and eventually higher-level 
topic, we built a near-complete list of WikiProjects based on the WikiProject 
Directory [4] and gathered all of their associated templates. We purposefully 
excluded WikiProjects under the assistance / maintenance category [5]. When 
parsing talk pages from the dump files then, we check for any of these 
templates and list them under their canonical name. As a backup, we also employ 
case-insensitive string matching with "WP" and "WikiProject", which helps to 
guarantee that we did not miss any WikiProjects but introduces a number of 
false positives as well. If you wish to map the WikiProjects listed in the 
dataset to their higher-level topics, the mapping is in the figshare item and 
code that allows you to do that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/taxo_label.py


[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council

[2] https://dl.acm.org/doi/10.1145/3274290

[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedia_Topic_Assessment


[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory
[5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikipedia

Best,
Isaac

--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation 
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to