On Sat, Oct 7, 2017 at 1:00 PM, Andreas Kolbe <jayen...@gmail.com> wrote:
> ... and it will all become one free mush everyone copies to make a buck. We > are already in a situation today where anyone asking Siri, the Amazon Echo, > Google or Bing about a topic is likely to get the same answer from all of > them, because they all import Wikimedia content, which comes free of > charge. I wouldn't call information from Wikimedia projects a "mush", but I think it's a good term for the proprietary amalgamation of information and data from many sources, often without any regard for the reliability of the source. Google is the king of such gooey amalgamation. Its home assistant has been known to give answers like this, sourced to "secretsofthefed.com": "According to details exposed in Western Center for Journalism's exclusive video, not only could Obama be in be in bed with the communist Chinese, but Obama may in fact be planning a communist coup d'état at the end of his term in 2016." See, e.g., this article https://theoutline.com/post/1192/google-s-featured-snippets-are-worse-than-fake-news for other egregious examples specifically from Google's featured responses. It's certainly true that Wikipedia is an easy target for ingestion, not just because of its copyright status, but also because it is comprehensive, multilingual, unrestricted (as in, not behind a paywall or rate limit), and even fully available for download. But copyright status is not really a major barrier once you are talking about fact extraction and "fair use" snippets. For Google, I suggest a query like "when was slavery abolished?" followed by exploring the auto-suggested questions. In my case, the first 10 questions point to snippets from: - pbs.org (twice) - USA Today - Reuters - archives.gov - Wikipedia (twice) - infoplease.com - ourdocuments.gov - nationalarchives.gov.uk Even for its fact boxes, where Wikipedia excerpts often feature prominently, Google does not exclusively rely on it; the tabular data contains information not found in any Wikimedia project. Even the textual blurbs often come from sources of unclear provenance; for example, country blurb text (try googling "France" or "Russia") is not from WP. This amalgamation will get ever more sophisticated and more proprietary (specific to each of these corporations) as AI improves. That's because it lets companies pry apart "facts" and "expression": the former are uncopyrightable. As textual understanding of AIs improves, more information can be summarized and presented without even invoking "fair use", much in the same way as Wikipedia itself summarizes sources. It's the universe of linked open data (Wikipedia/Wikidata, OpenStreetMap, and other open datasets) that keeps the space at least somewhat competitive, by giving players without much of a foothold a starting point from which to build. If Wikimedia did not exist, a smaller number of commercial players would wield greater power, due to the higher relative payoff of large investments in data mining and AI. > I find that worrying, because as an information delivery system, > it’s not robust. You change one source, and all the other sources > change as well. As noted above, this is not actually what is happening. Commercial players don't want to limit themselves to free/open data; they want to use AI to extract as much information about the world as possible so they can answer as many queries as possible. And for most of the sources amalgamated in this manner, if provenance is indicated at all, we don't find any of the safeguards we have for Wikimedia content (revisioning, participatory decision-making, transparent policies, etc.). Editability, while opening the floodgate to a category of problems other sources don't have, is in fact also a safeguard: making it possible to fix mistakes instead of going through a "feedback" form that ends up who knows where. With an eye to 2030 and WMF's long-term direction, I do think it's worth thinking about Wikidata's centrality, and I would agree with you at least that the phrase "the essential infrastructure of the ecosystem" does overstate what I think WMF should aspire to (the "essential infrastructure" should consist of many open components maintained by different groups). But beyond that I think you're reading stuff into the statement that isn't there. Wikidata in particular is best seen not as the singular source of truth, but as an important hub in a network of open data providers -- primarily governments, public institutions, nonprofits. This is consistent with recent developments around Wikidata such as query federation. Wikidata will often provide a shallow first level of information about a subject, while other linked sources provide deeper information. The more structured the information, the easier it becomes to validate in an automatic fashion that, for example, the subset of country population time series data represented in Wikidata is an accurate representation of the source material. Even when a large source dataset is mirrored by Wikimedia (for low-latency visualization, say), you can hash it, digitally sign it, and restrict modifiability of copies. If we expose the history, provenance and structure of information, and the connections between sources, we can actually make the information more resilient against manipulation than if it is merely a piece of text in an article, some number in an {{infobox}} template or some "factoid" in a proprietary knowledge graph. > is it just that some of the world's most profitable companies earn billions > from volunteers' work, gaining political power in the process, while > volunteers actually pay to go online and access or purchase the sources > they need to do their work? Yes or no? I don't accept your framing. Search the way it used to be (with algorithms primarily tuned for relevance of results) was a fair deal for everyone involved: you put stuff on the web, it gets indexed and people are able to find it; the search engines make money by putting ads on the search result page. The amalgamation of information into knowledge graphs that deliver concise answers directly (however inadequate) changes the dynamic significantly. It accords ever greater power to the maintainers of these proprietary graphs which, I hasten to repeat, incorporate information well beyond just Wikimedia's, and which frequently fail to indicate provenance in an adequate manner. And, as the example at the beginning of this message shows, it leads to "information pollution", with fake news, conspiracy theories and pseudoscience leaking into semi-authoritative instant answers. I don't think the social justice problem here is that these companies make a profit, but that they function more and more as gatekeepers and curators of knowledge, a role for which they're ill-equipped and which civil society should be reluctant to give them. But the proprietary knowledge graphs are valuable to users in ways that the previous generation of search engines was not. Interacting with a device like you would with a human being ("Alexa/Google/Siri, is yarrow edible?") makes knowledge more accessible and usable, including to people who have difficulty reading long texts, or who are not literate at all. In this sense I don't think WMF should ever find itself in the position to argue _against_ inclusion of information from Wikimedia projects in these applications. The applications themselves are not the problem; the centralized gatekeeper control is. Knowledge as an open service (and network) is actually the solution to that root problem. It's how we weaken and perhaps even break the control of the gatekeepers. Your critique seems to boil down to "Let's ask Google for more crumbs". In spite of all your anti-corporate social justice rhetoric, that seems to be the path to developing a one-sided dependency relationship. To be clear, I'm in favor of corporations giving more to the commons, though in my ideal world, that would happen through aggressive taxation and greater public investment (especially in schools, universities and GLAMs). I have every confidence that WMF does in fact ask for as much as it can be expected to in conversations with corporations, but it's not clear what you're suggesting should happen if the corporations say no. Erik _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>