The only thing is that the “real life” problem is the text changing but the 
citations stays the same. I don’t see the opposite happen much.

 

Another thought I had was of course to preserve details of the edit which added 
the citation initially, user, timestamp, edit summary, etc

 

It would be interesting to find “cliques” (in the loose social sense not the 
strict mathematical sense) of users who seem to use the same “clique of 
citations”. Such groups might be sockpuppets, meatpuppets etc. Of course, they 
might just be good faith editors accessing the same very useful resources for 
their favourite topic area.  But I guess if you “smell a rat” with one user or 
one source, then it might be handy to explore any “cliques” they appear to be 
operating within to look for suspicious activity of the others. 

 

I am not quite sure what we might learn from the edit summaries, but I guess if 
they are not collected, we will never know if they do contain any interesting 
patterns.

 

Another thought that occurs to me is that there is at least one situation when 
some the text of interest may follow the citation rather precede it and that is 
list. E.g

 

The presidents of the USA are:<ref> one reliable source about all of the 
presidents</ref>

*        George Washington

*        …

*        Donald Trump

 

Also citations within tables pose a bit of a problem in terms of their “span”. 
Is it just the cell with the citation? Is it more? I see tables with the last 
column being used to hold citations for data that populates that whole row. 

 

Also citations in infoboxes  where there is one field carrying some data 
followed by a corresponding citation field, e.g. pop and pop_footnotes (for 
population in infobox Australian place).

 

The more I think about this issue, the more I despair. Not so much for this 
project to build a citation database, but rather for the fact that without any 
binding of article text to the citation, the connection between them is likely 
to degrade as successive contributors come along and modify the article, 
particularly so if they cannot access the source. I think we have let ourselves 
be seduced into thinking that so long as we can *see* a lot of inline 
citations, [1][2][3] in our article that it is well-sourced, but if we really 
can’t explain what text is supported by which source, is it really 
well-sourced? You might as well just add a bibliography to the end and forget 
in-line citations. Now one might argue this is just as true with a traditional 
journal article  (again, no explicit binding of text to source), but the 
difference is that a traditional journal article has a single author or a group 
of tightly-coupled authors writing the journal article over a relatively short 
period of time (weeks rather than years), who are likely to have shared access 
to every source being cited and are able to confer among themselves if needed 
to sort out any issue relating to citations, so we can expect the citations to 
remain close to the text being supported by the citation. In Wikipedia, we have 
a disconnected set of authors operating over different time frames over an 
article lifetime of many years who are unable to share their source materials 
and so I think the coupling between text and citation is inevitably likely to 
be lost because we leave no trace of the coupling for the next contributor to 
uphold, even when everyone is acting in good faith. Let’s call it “cite rot”, 
which I’ll define as a loss of verifiability due to disconnect between article 
text and source.

 

It seems to me that we need to make the connection between text and source more 
explicit. Think of it from a reader perspective, in most e-readers you can 
select a word or phrase and a dictionary lookup is performed to tell you the 
meaning of the word(s). How about if in the Wikipedia of 2030 (since we 
discussing movement strategy at the moment), the reader could select some words 
and the sources are returned that supports them. E.g. currently we might write

 

Joe Smith was born in London in 1830.[1][2]

 

Where [1] supports that he was born in London and [2] that he was born in 1830.

 

In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight (or 
something) and if we clicked on 1830, [2] would highlight and if we clicked on 
born, both would highlight. That is the words “Joe Smith was born in London” 
would be tagged as being [1] and “Joe Smith was born …. In 1830” would be 
tagged as being [2]. And probably a little pop-up with the exact quote out of 
the source document might appear for your verification pleasure.

 

Now of course we have enough problems with getting our contributors to supply 
any sources, let alone binding them to chunks of text as my proposal would 
entail. But I hear the Movement Strategy conversation is talking about improved 
quality and is talking about improved verifiability, so maybe it’s part of the 
quality assessment, if you want a VGA (verifiable good article), the 
text-to-cite mapping must be embedded in the article and almost all of the text 
is “covered” (in the mathematical sense) by the mapping. Indeed, the extent of 
coverage could be a verifiability metric.

 

OK, maybe what I am proposing is not the way to go, but I think we ought to be 
thinking about this issue of cite rot, because I think it’s a real problem. I 
suspect it’s already out there but we don’t notice it because we *see* lots of 
inline citations and assume all is well.

 

Kerry

 

From: Andrea Forte [mailto:[email protected]] 
Sent: Wednesday, 3 May 2017 11:46 PM
To: [email protected]
Cc: Research into Wikimedia content and communities 
<[email protected]>
Subject: Re: [Wiki-research-l] Citation Project - Comments Welcome!

 

 

...and YES, detecting when a reference has changed but the adjacent text has 
not is something that will be detectable with the dataset we aim to produce. 
That's a great idea!

 

On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[email protected] 
<mailto:[email protected]> > wrote:

Just a couple of thoughts that cross my mind ...

If people use the {{cite book}} etc templates, it will be relatively easy to 
work out what the components of the citation are. However if people roll their 
own, e.g.

<ref>[http://someurl This And That], Blah Blah 2000</ref>

you may have some difficulty working out what is what. I've just been though a 
tedious exercise of updating a set of URLs using AWB over some thousands of 
articles and some of the ways people roll their own citations were quite 
remarkable (and often quite unhelpful). It may be that you can't extract much 
from such citations. However, the good news is that if they have a URL in them, 
it will probably be in plain-sight.

Whereas there are a number of templates that I regularly use for citation like 
{{cite QHR}} (currently 1234 transclusions) and {{cite QPN}} (currently 2738  
transclusions) and {{Census 2011 AUS}} (4400 transclusions) all of which 
generate their URLs. I'm not sure how you will deal with these in terms of 
extracting URLs.

But whatever the limitations, it will be a useful dataset to answer some 
interesting questions.

One phenomena I often see is new users updating information (e.g. changing the 
population of a town) while leaving behind the old citation for the previous 
value. So it superficially looks like the new information is cited to a 
reliable source when in fact it isn't. I've often wished we could automatically 
detect and raise a "warning" when the "text being supported" by the citation 
changes yet the citation does not. The problem, of course, is that we only know 
where the citation appears in the text and that we presume it is in support for 
"some earlier" text (without being clear exactly where it is). And if an 
article is reorganised, it may well result in the citation "drifting away" from 
the text it supports or even that it is in support of text that has been 
deleted. So I think it is important to know what text preceded the citation at 
the time the citation first appears in the article history as it may be useful 
to compare it against the text that *now* appears before it. It is a great pity 
that (in these digital times) we have not developed a citation model where you 
select chunks of text and link your citation to them, so that the relationship 
between the text and the citation is more apparent.

Kerry


-----Original Message-----
From: Wiki-research-l [mailto:[email protected] 
<mailto:[email protected]> ] On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities 
<[email protected] 
<mailto:[email protected]> >
Subject: [Wiki-research-l] Citation Project - Comments Welcome!

Hi all,


One of my PhD students, Meen Chul Kim, is a data scientist with experience in 
bibliometrics and we will be working on some citation-related research together 
with Aaron and Dario in the coming months. Our main goal in the short term is 
to develop an enhanced citation dataset that will allow for future analyses of 
citation data associated with article quality, lifecycle, editing trends, etc.


The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia


The project is just getting started so this is a great time to offer feedback 
and suggestions, especially for features of citations that we should mine as a 
first step, since this will affect what the dataset can be used for in the 
future.


Looking forward to seeing some of you at WikiCite!!

Andrea




--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net

_______________________________________________
Wiki-research-l mailing list
[email protected] 
<mailto:[email protected]> 
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 

-- 

 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net

_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to