Re: [Wikimedia-l] Quality issues

Andreas Kolbe Mon, 07 Dec 2015 15:02:59 -0800

Hi Markus,

On 1 December 2015 at 23:43, Markus Krötzsch <markus at
semantic-mediawiki.org>
<wikidata%40lists.wikimedia.org?Subject=Re%3A%20%5BWikidata%5D%20%5BWikimedia-l%5D%20Quality%20issues&In-Reply-To=%3C565E30AB.6000709%40semantic-mediawiki.org%3E>
wrote:

> [I continue cross-posting for this reply, but it would make sense to
> return the thread to the Wikidata list where it started, so as to avoid
> partial discussions happening in many places.]

Apologies for the late reply.

While you indicated that you had crossposted this reply to
Wikimedia-l, it didn't turn up in my inbox. I only saw it today, after
Atlasowa pointed it out on the Signpost op-ed's talk page.[1]

> On 27.11.2015 12:08, Andreas Kolbe wrote:

> >* Wikipedia content is considered a reliable source in Wikidata, and
*> >* Wikidata content is used as a reliable source by Google, where it
*> >* appears without any indication of its provenance.*

> This prompted me to reply. I wanted to write an email that merely says: >
"Really? Where did you get this from?" (Google using Wikidata content)

Multiple sources, including what appears to be your own research
group's writing:[2]

---o0o---

In December 2013, Google announced that their own collaboratively
edited knowledge base, Freebase, is to be discontinued in favour of
Wikidata, which gives Wikidata a prominent role as an in[p]ut for
Google Knowledge Graph. The research group Knowledge Systems
<https://ddll.inf.tu-dresden.de/web/Knowledge_Systems/en> is working
in close cooperation with the development team behind Wikidata, and
provides, e.g., the regular Wikidata RDF-Exports.

---o0o---

> But then I read the rest ... so here you go ...

> Your email mixes up many things and effects, some of which are important
> issues (e.g., the fact that VIAF is not a primary data source that
> should be used in citations). Many other of your remarks I find very
> hard to take serious, including but not limited to the following:

> * A rather bizarre connection between licensing models and
> accountability (as if it would make content more credible if you are
> legally required to say that you found it on Wikipedia, or even give a
> list of user names and IPs who contributed)

Both Freebase and Wikipedia have attribution licences. When Bing's
Snapshot displays information drawn from Freebase or Wikipedia, it's
indicated thus at the bottom of the infobox[3]:

---o0o---

Data from Freebase · Wikipedia

---o0o---

I take this as a token gesture to these sources' attribution licences.

Given the amount of space they have available, I would think most
people would agree that this form of attribution is sufficient. You
couldn't possibly expect them to list all contributors who have ever
contributed to the lead of the Wikipedia article, for example, as the
letter of the licence might require.

However, I think it's proper and important that those minimal
attributions are there. And given Wikidata's CC0 licence, I don't
expect re-users to continue attributing in this manner. This view is
shared by Max Klein for example, who is quoted to that effect in the
Signpost op-ed.[4]

> * Some stories that I think you really just made up for the sake of > 
> argument (Denny alone has picked the Wikidata license?

Denny led the development team. There are multiple public instances
and accounts of his having advocated this choice and convinced people
of the wisdom of it, in Wikidata talk pages and elsewhere, including a
recent post on the Wikidata mailing list.[5]

Interestingly, he originally said that this would mean there could be
no imports from Wikipedia, and that there was in fact no intention to
import data from Wikipedias (see op-ed).[6] He also said, higher up on
that page, that this was "for starters", and that that decision could
easily be changed later on by the community.[7]

> Google displays Wikidata content?

See above. If Wikidata plays "a prominent role as an in[p]ut for
Google Knowledge Graph" then I would expect there to be
correspondences between Knowledge Graph and Wikidata content.

> Bing is fuelled by Wikimedia?)

I spoke of "Wikimedia-fuelled search engines like Google and Bing" in
the context of the Google Knowledge Graph and Bing's Snapshot/Satori
equivalent.

We all know that in both cases, much of the content Google and Bing
display in these infoboxes comes from Wikimedia projects (Wikipedia,
Commons and now, apparently, Wikidata).

> * Some disjointed remarks about the history of capitalism> * The assertion 
> that content is worse just because the author who > created it used a bot for 
> editing

I spoke of "bot users mass-importing unreliable data". It's not the
bot method that makes the data unreliable: they are unreliable to
begin with (because they are unsourced, nobody verifies the source,
etc.).

As I pointed out in this week's op-ed, of the top fifteen hoaxes in
the English Wikipedia, six have active Wikidata items (or rather, had:
they were deleted this morning, after the op-ed appeared).

This is what I mean by unreliable data.

> * The idea that engineers want to build systems with bad data because > they 
> like the challenge of cleaning it up -- I mean: really! There is > nothing 
> one can even say to this.

Again, this is not quite what I was trying to convey. My impression is
that the current community effort at Wikidata emphasises speed: hence
the mass imports of data from Wikipedia, whether verifiable or not,
contrary to original intentions, as represented by Denny's quote
above.

As far as I can make out, present-day thinking among many Wikidatans
is: let's get lots of data in fast even though we know some of it will
be bad. Afterwards, we can then apply clever methods to check for
inconsistencies and clean our data up -- which is a challenge people
do seem to warm to. Meanwhile, others throw up their arms in dismay
and say, "Stop! You're importing bad data."

Wouldn't you agree that this characterises some of the recent
discussions on the Wikidata Project Chat page?

The two camps seem approximately evenly represented in the discussions
I've seen. But while the one camp says "Stop!", the other camp
continues importing. So in practice, the importers are getting their
way.

> * The complaint that Wikimedia employs too much engineering expertise > and 
> too little content expertise (when, in reality, it is a key > principle of 
> Wikimedia to keep out of content, and communities regularly > complain WMF 
> would still meddle too much).

Is it not obvious that I was talking about community practices rather
than the actions of Wikimedia staff?

> * All those convincing arguments you make against open, anonymous > editing 
> because of it being easy to manipulate (I've heard this from > Wikipedia 
> critics ten years ago; wonder what became of them)

Such criticisms are still regularly levelled at Wikipedia, in
top-quality publications. If you really want, I can send you a
literature list, but you could begin with this article in Newsweek.[6]

> * And, finally, the culminating conspiracy theory of total control over > 
> political opinion, destroying all plurality by allowing only one > viewpoint 
> (not exactly what I observe on the Web ...) -- and topping > this by blaming 
> it all on the choice of a particular Creative Commons > license for Wikidata! 
> Really, you can't make this up.

The information provided by default to billions of search engine users
*matters*. You can never prevent an individual from going to a website
that espouses a different view, but you don't have to for that
information to have a measurable effect.

Robert Epstein and Ronald E. Robertson recently published a paper on
what they called "The search engine manipulation effect (SEME) and its
possible impact on the outcomes of elections".[9] It provides further
detail.

> Summing up: either this is an elaborate satire that tries to test how > 
> serious an answer you will get on a Wikimedia list, or you should > 
> *seriously* rethink what you wrote here, take back the things that are > 
> obviously bogus, and have a down-to-earth discussion about the topics > you 
> really care about (licenses and cyclic sourcing on Wikimedia > projects, I 
> guess; "capitalist companies controlling public media" > should be discussed 
> in another forum).

No satire was intended. I hope I have succeeded in making my points clearer.

Regards,

Andreas

[1]
https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedia_Signpost/2015-12-02/Op-ed
[2] https://ddll.inf.tu-dresden.de/web/Wikidata/en
[3]
http://www.bing.com/search?q=jerusalem&go=Submit&qs=n&form=QBLH&pq=jerusalem&sc=9-9&sp=-1&sk=&cvid=62C12B6CC7B94CD1A9081E17AC205270
[4]
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-12-02/Op-ed
[5] https://lists.wikimedia.org/pipermail/wikidata/2015-December/007769.html
[6] https://archive.is/ZbV5A#selection-2997.0-3009.26
[7] https://archive.is/ZbV5A#selection-2755.308-2763.27
[8]
http://www.newsweek.com/2015/04/03/manipulating-wikipedia-promote-bogus-business-school-316133.html
[9] http://www.pnas.org/content/112/33/E4512.abstract
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
[email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] Quality issues

Reply via email to