On Sat, Oct 7, 2017 at 1:00 PM, Andreas Kolbe <jayen...@gmail.com> wrote:

> ... and it will all become one free mush everyone copies to make a buck. We
> are already in a situation today where anyone asking Siri, the Amazon Echo,
> Google or Bing about a topic is likely to get the same answer from all of
> them, because they all import Wikimedia content, which comes free of
> charge.

I wouldn't call information from Wikimedia projects a "mush", but I
think it's a good term for the proprietary amalgamation of information
and data from many sources, often without any regard for the
reliability of the source. Google is the king of such gooey
amalgamation. Its home assistant has been known to give answers like
this, sourced to "secretsofthefed.com":

     "According to details exposed in Western Center for Journalism's
      exclusive video, not only could Obama be in be in bed with the
      communist Chinese, but Obama may in fact be planning a
      communist coup d'état at the end of his term in 2016."

See, e.g., this article


for other egregious examples specifically from Google's featured responses.

It's certainly true that Wikipedia is an easy target for ingestion,
not just because of its copyright status, but also because it is
comprehensive, multilingual, unrestricted (as in, not behind a paywall
or rate limit), and even fully available for download. But copyright
status is not really a major barrier once you are talking about fact
extraction and "fair use" snippets.

For Google, I suggest a query like "when was slavery abolished?"
followed by exploring the auto-suggested questions. In my case, the
first 10 questions point to snippets from:

- pbs.org (twice)
- USA Today
- Reuters
- archives.gov
- Wikipedia (twice)
- infoplease.com
- ourdocuments.gov
- nationalarchives.gov.uk

Even for its fact boxes, where Wikipedia excerpts often feature
prominently, Google does not exclusively rely on it; the tabular data
contains information not found in any Wikimedia project. Even the
textual blurbs often come from sources of unclear provenance; for
example, country blurb text (try googling "France" or "Russia") is not
from WP.

This amalgamation will get ever more sophisticated and more
proprietary (specific to each of these corporations) as AI improves.
That's because it lets companies pry apart "facts" and "expression":
the former are uncopyrightable. As textual understanding of AIs
improves, more information can be summarized and presented without
even invoking "fair use", much in the same way as Wikipedia itself
summarizes sources.

It's the universe of linked open data (Wikipedia/Wikidata,
OpenStreetMap, and other open datasets) that keeps the space at least
somewhat competitive, by giving players without much of a foothold a
starting point from which to build. If Wikimedia did not exist, a
smaller number of commercial players would wield greater power, due to
the higher relative payoff of large investments in data mining and AI.

> I find that worrying, because as an information delivery system,
> it’s not robust. You change one source, and all the other sources
> change as well.

As noted above, this is not actually what is happening. Commercial
players don't want to limit themselves to free/open data; they want to
use AI to extract as much information about the world as possible so
they can answer as many queries as possible.

And for most of the sources amalgamated in this manner, if provenance
is indicated at all, we don't find any of the safeguards we have for
Wikimedia content (revisioning, participatory decision-making,
transparent policies, etc.). Editability, while opening the floodgate
to a category of problems other sources don't have, is in fact also a
safeguard: making it possible to fix mistakes instead of going through
a "feedback" form that ends up who knows where.

With an eye to 2030 and WMF's long-term direction, I do think it's
worth thinking about Wikidata's centrality, and I would agree with you
at least that the phrase "the essential infrastructure of the
ecosystem" does overstate what I think WMF should aspire to (the
"essential infrastructure" should consist of many open components
maintained by different groups). But beyond that I think you're
reading stuff into the statement that isn't there.

Wikidata in particular is best seen not as the singular source of
truth, but as an important hub in a network of open data providers --
primarily governments, public institutions, nonprofits. This is
consistent with recent developments around Wikidata such as query

Wikidata will often provide a shallow first level of information about
a subject, while other linked sources provide deeper information. The
more structured the information, the easier it becomes to validate in
an automatic fashion that, for example, the subset of country
population time series data represented in Wikidata is an accurate
representation of the source material. Even when a large source
dataset is mirrored by Wikimedia (for low-latency visualization, say),
you can hash it, digitally sign it, and restrict modifiability of

If we expose the history, provenance and structure of information, and
the connections between sources, we can actually make the information
more resilient against manipulation than if it is merely a piece of
text in an article, some number in an {{infobox}} template or some
"factoid" in a proprietary knowledge graph.

> is it just that some of the world's most profitable companies earn billions
> from volunteers' work, gaining political power in the process, while
> volunteers actually pay to go online and access or purchase the sources
> they need to do their work? Yes or no?

I don't accept your framing. Search the way it used to be (with
algorithms primarily tuned for relevance of results) was a fair deal
for everyone involved: you put stuff on the web, it gets indexed and
people are able to find it; the search engines make money by putting
ads on the search result page. The amalgamation of information into
knowledge graphs that deliver concise answers directly (however
inadequate) changes the dynamic significantly.

It accords ever greater power to the maintainers of these proprietary
graphs which, I hasten to repeat, incorporate information well beyond
just Wikimedia's, and which frequently fail to indicate provenance in
an adequate manner. And, as the example at the beginning of this
message shows, it leads to "information pollution", with fake news,
conspiracy theories and pseudoscience leaking into semi-authoritative
instant answers.

I don't think the social justice problem here is that these companies
make a profit, but that they function more and more as gatekeepers and
curators of knowledge, a role for which they're ill-equipped and which
civil society should be reluctant to give them.

But the proprietary knowledge graphs are valuable to users in ways
that the previous generation of search engines was not. Interacting
with a device like you would with a human being ("Alexa/Google/Siri,
is yarrow edible?") makes knowledge more accessible and usable,
including to people who have difficulty reading long texts, or who are
not literate at all. In this sense I don't think WMF should ever find
itself in the position to argue _against_ inclusion of information
from Wikimedia projects in these applications.

The applications themselves are not the problem; the centralized
gatekeeper control is. Knowledge as an open service (and network) is
actually the solution to that root problem. It's how we weaken and
perhaps even break the control of the gatekeepers. Your critique seems
to boil down to "Let's ask Google for more crumbs". In spite of all
your anti-corporate social justice rhetoric, that seems to be the path
to developing a one-sided dependency relationship.

To be clear, I'm in favor of corporations giving more to the commons,
though in my ideal world, that would happen through aggressive
taxation and greater public investment (especially in schools,
universities and GLAMs). I have every confidence that WMF does in fact
ask for as much as it can be expected to in conversations with
corporations, but it's not clear what you're suggesting should happen
if the corporations say no.


Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 

Reply via email to