Just a couple of thoughts that cross my mind ...

If people use the {{cite book}} etc templates, it will be relatively easy to 
work out what the components of the citation are. However if people roll their 
own, e.g.

<ref>[http://someurl This And That], Blah Blah 2000</ref>

you may have some difficulty working out what is what. I've just been though a 
tedious exercise of updating a set of URLs using AWB over some thousands of 
articles and some of the ways people roll their own citations were quite 
remarkable (and often quite unhelpful). It may be that you can't extract much 
from such citations. However, the good news is that if they have a URL in them, 
it will probably be in plain-sight.

Whereas there are a number of templates that I regularly use for citation like 
{{cite QHR}} (currently 1234 transclusions) and {{cite QPN}} (currently 2738  
transclusions) and {{Census 2011 AUS}} (4400 transclusions) all of which 
generate their URLs. I'm not sure how you will deal with these in terms of 
extracting URLs.

But whatever the limitations, it will be a useful dataset to answer some 
interesting questions.

One phenomena I often see is new users updating information (e.g. changing the 
population of a town) while leaving behind the old citation for the previous 
value. So it superficially looks like the new information is cited to a 
reliable source when in fact it isn't. I've often wished we could automatically 
detect and raise a "warning" when the "text being supported" by the citation 
changes yet the citation does not. The problem, of course, is that we only know 
where the citation appears in the text and that we presume it is in support for 
"some earlier" text (without being clear exactly where it is). And if an 
article is reorganised, it may well result in the citation "drifting away" from 
the text it supports or even that it is in support of text that has been 
deleted. So I think it is important to know what text preceded the citation at 
the time the citation first appears in the article history as it may be useful 
to compare it against the text that *now* appears before it. It is a great pity 
that (in these digital times) we have not developed a citation model where you 
select chunks of text and link your citation to them, so that the relationship 
between the text and the citation is more apparent.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[email protected]] On 
Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities 
<[email protected]>
Subject: [Wiki-research-l] Citation Project - Comments Welcome!

Hi all,


One of my PhD students, Meen Chul Kim, is a data scientist with experience in 
bibliometrics and we will be working on some citation-related research together 
with Aaron and Dario in the coming months. Our main goal in the short term is 
to develop an enhanced citation dataset that will allow for future analyses of 
citation data associated with article quality, lifecycle, editing trends, etc.


The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia


The project is just getting started so this is a great time to offer feedback 
and suggestions, especially for features of citations that we should mine as a 
first step, since this will affect what the dataset can be used for in the 
future.


Looking forward to seeing some of you at WikiCite!!

Andrea




--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to