Re: [Wikimedia-l] Quality issues

Andreas Kolbe Sun, 29 Nov 2015 06:10:35 -0800

Gergo,

On Sun, Nov 29, 2015 at 12:36 AM, Gergo Tisza <[email protected]> wrote:

> By the same logic, to the extent Wikipedia takes its facts from non-free
> external source, its free license would be a copyright violation. Luckily
> for us, that's not how copyright works.

I'm aware that facts are not copyrightable. By the same logic, Wikidata
being offered under a CC BY-SA license, say, would not prevent anyone from
extracting facts -- knowledge -- from it, and it would enable Wikidata to
import a lot of data it presently cannot, because of licence
incompatibilities.

> Statements of facts can not be
> copyrighted; large-scale arrangements of facts (ie. a full database)
> probably can, but CC does not prevent others from using them without
> attribution, just distributing them (again, it's like the GPL/Affero
> difference);

Distribution is the issue here – large-scale distribution and viral
propagation of data with a well-documented potential for manipulation and
error, in a way that makes the provenance of these data a closed book to
the end user.

Do you accept that this is a potential problem, and if so, how would you
guard against it, if not through the licence?

> there are sui generis database rights in some countries but
> not in the USA where both Wikipedia and most proprietary
> reusers/compatitors are located, so relying on neighbouring rights would
> not help there but cause legal uncertainty for reusers (e.g. OSM which has
> lots of legal trouble importing coordinates due to being EU-based).
>

It seems noteworthy that Freebase specifically said, with regard to loading
structured data, "If a data source is under CC-BY, you can load it into
Freebase as long as you provide attribution."[1]

Wikidata practice seems to have taken a different path regarding licence
compatibility, given its systematic imports from Wikipedia.

Interestingly enough, it's been pointed out to me that Denny said in
2012,[2]

---o0o---

Alexrk2, it is true that Wikidata under CC0 would not be allowed to import
content from a Share-Alike data source. Wikidata does not plan to extract
content out of Wikipedia at all. Wikidata will *provide* data that can be
reused in the Wikipedias. And a CC0 source can be used by a Share-Alike
project, be it either Wikipedia or OSM. But not the other way around. Do we
agree on this understanding? --Denny Vrandečić (WMDE)
<https://meta.wikimedia.org/wiki/User:Denny_Vrande%C4%8Di%C4%87_(WMDE)> (
talk
<https://meta.wikimedia.org/wiki/User_talk:Denny_Vrande%C4%8Di%C4%87_(WMDE)>)
12:39, 4 July 2012 (UTC)

---o0o---

The key sentence here is "Wikidata does not plan to extract content out of
Wikipedia at all."

That doesn't seem to be how things have turned out, because today we have
people on Wikidata raising alarms about mass imports from Wikipedia:[3]

---o0o---

Reliable Bot imports from wikipedias?

In a Wikipedia discussion I came by chance across a link to the following
discussion:

   - Wikidata:Project_chat/Archive/2015/10#STOP_with_bot_import

<https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2015/10#STOP_with_bot_import>

[...] To provide an outside perspective as Wikipedian (and a potential
use[r] of WD in the future). I wholeheartedly agree with Snipre, in fact
"bots [ar]e running wild" and the uncontrolled import of data/information
from Wikipedias is one of the main reasons for some Wikipedias developing
an increasingly hostile attitude towards WD and its usage in Wikipedias.
*If* WD is ever to function as a central data storage for various Wikimedia
projects and in particular Wikipedia as well (in analogy to Commons), *then*
 quality has to take the driver's seat over quantity. A central storage
needs a much better data integrity than the projects using it, because one
mistake in its data will multiply throughout the projects relying on WD,
which may cause all sorts of problems. For crude comparison think of a
virus placed on a central server than on a single client.The consequences
are much more severe and nobody in their right mind would run the server
with even less protection/restrictions than the client.

Another thin[g] is, that if you envision users of other Wikimedia projects
such as Wikipedia or even 3rd party external projects to eventually help
with data maintenance when they start using WD, then you might find them
rather unwilling to do so, if not enough attention is paid to quality,
instead they probably just dump WD from their projects.

In general all the advantages of the central data storage depend on the
quality (reliability) of data. If that is not given to reasonable high
degree, there is no point to have central data storage at all. All the
great application become useless if they operate on false data.--Kmhkmh
<https://www.wikidata.org/w/index.php?title=User:Kmhkmh&action=edit&redlink=1>
 (talk <https://www.wikidata.org/wiki/User_talk:Kmhkmh>) 12:00, 19 November
2015 (UTC)

---o0o---

(I was unaware of that post by Kmhkmh when I started contributing to this
discussion, but it obviously echoes some of my own concerns.)

I've been told on the German Wikipedia that the Wikidata CC0 licence has
long been a controversial issue, subject to recurrent discussion,
especially with regard to official population statistics in Europe, whose
publishers often require attribution, making their wholesale import in
Wikidata's CC0 environment problematic.[4]

In reviewing these discussions, I couldn't help but be reminded of
Flickrwashing schemes by some contributors' lines of thought: how -- via
which intermediary steps -- can we get the info into our CC0 project
without being seen to fall foul of the original publishers' licenses?

As I understand it, the intent is to bully other data publishers into
making their data available under CC0 as well. I understand this from an
open-content perspective, and I can see how it might benefit Google's and
other information platforms' bottom line, but I reiterate -- there are
very, very significant downsides to having a central database subject to
anonymous manipulation by all comers whose data is automatically propagated
by major search engines. There are many autocratic regimes in the world
today who spend a lot of money and effort to achieve this kind of uniform
media response in their countries.

In my opinion, it creates a significant vulnerability in the global
information infrastructure. If, in more troubled times ahead, people are
fed the same unattributed lie by all major online outlets, because they are
all automatically propagating the content of Wikimedia's CC0 database, then
this could potentially alter the course of history, and not in a good way.

I am happy to hear ideas about how to address this that do not involve
licensing. We need more transparency about data provenance.

You may argue that Wikidata is still in its early days, and has nowhere
near the amount of data, nowhere near the reach and impact today to justify
such an effort. Maybe it never will, and I'm worrying for nothing.

But we thought much the same about Wikipedia around the time of the
Seigenthaler incident. Before we knew it, Wikipedia had become the world's
dominant information resource, with increasing numbers of government
officials, judges, journalists and academics happy to accept its word
uncritically – in a way that horrifies most Wikipedians, who are well aware
of the system's weaknesses.

Last month for example the Wikipedian in Residence at NIOSH (National
Institute for Occupational Safety and Health) said on Wikidata that he
would "cringe" at the thought of using Wikipedia as a source and personally
refrained from it:[5]

---o0o---

   - As a note, I do semi-automated edits on my work account
   <https://www.wikidata.org/wiki/User:James_Hare_(NIOSH)>, and I plan on
   doing some as a volunteer as well. I don't use Wikipedia as a source (as
   a Wikipedian of 11 years, I cringe at the thought ;), but if any batch
   edits I do manage to screw something up despite my meticulous planning,
   please let me know immediately. I will take responsibility for my own
   messes. Harej <https://www.wikidata.org/wiki/User:Harej> (talk
   <https://www.wikidata.org/wiki/User_talk:Harej>) 17:38, 27 October 2015
   (UTC)

---o0o---

If Wikidata were to acquire the global reach its makers and sponsors hope
for, then we would have done well to build a robust system that minimises
harm, and cannot become a victim of its own success. I propose that there
is work to be done here.

Coming back briefly to the legal licensing situation, it seems to be fairly
complex even in the US, according to the relevant Wikilegal page on
Meta[6], with much depending on the amount of material extracted, as you
pointed out above.

Things are more complicated still in the EU, given that European law
protects databases created by EU citizens or residents (which includes a
good number of Wikimedians), with that protection extending to "sweat of
the brow" (unprotected in the US). EU law even prohibits the "repeated and
systematic extraction" of "insubstantial parts of the contents" of a
database (where the term "database" is defined broadly enough to include a
Wikipedia).

There's not much point in my saying more about the legal aspects of
licensing; even the advice from the Foundation's legal professionals says
it's rarely easy to predict how a court might rule under either EU or US
law.[6]

Andreas

[1] http://wiki.freebase.com/wiki/License_compatibility
[2]
https://meta.wikimedia.org/wiki/Talk:Wikidata#Is_CC_the_right_license_for_data.3F
[3]
https://www.wikidata.org/wiki/Wikidata:Project_chat#Reliable_Bot_imports_from_wikipedias.3F
[4]
https://www.mail-archive.com/[email protected]/msg00178.html
https://www.wikidata.org/w/index.php?title=Wikidata:General_disclaimer&diff=271182&oldid=270466
http://osdir.com/ml/general/2012-11/msg31088.html
http://www.mail-archive.com/[email protected]/msg03088.html
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Modifying_license_.3F

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Data_release_email_templates

https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Archive/2014/04#License
 http://www.gossamer-threads.com/lists/wiki/foundation/450291#450291
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/05#Population_statistics.3F

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/04#Data_owners

[5]
https://www.wikidata.org/w/index.php?title=Wikidata:Project_chat&diff=prev&oldid=263358509

[6] https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
[email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] Quality issues

Reply via email to