Re: [Wikimedia-l] Quality issues

Gerard Meijssen Fri, 27 Nov 2015 11:27:06 -0800

Hoi,

I happen to work on Dukes of Friuli. Compare the data from Wikidata and the
information by Reasonator based on the same item for one of them.


https://tools.wmflabs.org/reasonator/?&q=2471519
https://www.wikidata.org/wiki/Q2471519

Wikidata is not informative, you have to work hard to get the information
that Reasonator provides already for over a year. All kinds of additional
services can easily be added like the QR code and the family tree. The
Reasonator info can be easily seen in any language, just add the labels.
Thanks,
      GerardM

On 27 November 2015 at 20:14, Lila Tretikov <[email protected]> wrote:

> Hoi Gerard,
>
> What I hear in email from Andreas and Liam is not as much the propagation
> of the error (which I am sure happens with some % of the cases), but the
> fact that the original source is obscured and therefore it is hard to
> identify and correct errors, biases, etc. Because if the source of error is
> obscured, that error is that much harder to find and to correct. In fact,
> we see this even on Wikipedia articles today (wrong dates of births sourced
> from publications that don't do enough fact checking is something I came
> across personally). It is a powerful and important principle on Wikipedia,
> but with content re-use it gets lost. Public domain/CC0 in combination with
> AI lands our content for slicing and dicing and re-arranging by others,
> making it something entirely new, but also detached from our process of
> validation and verification. I am curious to hear if people think it is a
> problem. It definitely worries me.
>
> We have been looking very closely at Wikidata and the possibilities it
> offers. I am curious to understand more about your note on Resonator:
>
> "As long as Wikidata does not
> have the power of a Reasonator, the data is just that. It does not make
> itself in information and consequently it is awful. When there is one thing
> the Wikidata engineers do not do, it is considering the use of the data and
> the workflows to improve the data and the quality."
>
> Am I understanding you saying that until the data sees the light of day it
> will not become of high quality?
>
> Thanks,
> Lila
>
> On Fri, Nov 27, 2015 at 10:26 AM, Gerard Meijssen <
> [email protected]
> > wrote:
>
> > Hoi,
> > When a benefit is "Wikimedia specific" and thereby dismissed, you miss
> much
> > of what is going on. Exactly because of this link most items are well
> > defined as to what they are about. It is not perfect but it is good.
> > Consequently Wikidata is able to link Wikipedia in any language to
> sources
> > external to Wikipedia. This is a big improvement over linking external
> > sources to a Wikipedia. The disambiguation of subjects is done at the
> > Wikidata end.
> >
> > You make Wikidata to be a "default reference source". Given its current
> > state, it is a bit much. Wikidata does not have the maturity to function
> as
> > such. The best pointer to this fact is that 50% of all items has two or
> > fewer statements.
> >
> > When you compare the quality of Wikipedias with what en.wp used to be you
> > are comparing apples and oranges. The Myanmar Wikipedia is better
> informed
> > on Myanmar than en.wp etc.
> >
> > When you qualify a Wikipedia as fascist, it does not follow that the data
> > is suspect. Certainly when data in a source that you so easily dismiss is
> > typically the same, there is not much meaning in what you say from a
> > Wikidata point of view.
> >
> > I am thrilled that sources are so important to the Wikimedia movement and
> > again, I am wondering what you hope to achieve by this pronouncement. Be
> > realistic what is it that you want to achieve? Is quality important to
> you
> > and, how do you define it and more importantly how do you want to achieve
> > it. Have you seen the statistics on sources [1]? Then have a better look
> > and you will find that real sources are mostly absent. Adding sources one
> > statement at a time will not significantly improve quality because that
> is
> > a numbers game and it is easier to achieve quality in a different way.
> >
> > When a librarian says that many sources copy each others data and that
> this
> > is a problem, the bigger problem is missed. The bigger problem is not
> where
> > they agree but where they disagree. Arguably they are the statements
> where
> > quality is more likely an issue. Now ask your librarian what is likely to
> > improve Wikidata more either find Sources for the statements that differ
> of
> > find Sources where the statements agree. Wikidata is not authoritative
> but
> > when our community starts researching such issues both Wikidata and other
> > sources will improve rapidly their quality. This is not to say that in
> the
> > end you want both Sources where sources agree and disagree.
> >
> > Then ask your librarian if there is a problem with missing data  We can
> > import data from sources and consequently be more informative or we do
> not
> > import more data and people have to magically combine information that
> > exists in many sources to get a composite view. We could see Wikidata as
> a
> > place where data is combined and compared with other sources, Do tell
> your
> > librarian that the process mentioned above should be iterative and it
> will
> > be easily understood that comparing with just one additional source will
> > improve the focus on likely issues even more.
> >
> > PS What does your librarian think when she knows that the Dutch National
> > Library is inclined to provide us with software so that books can be
> > ordered at Dutch libraries from Wikidata data (and by inference from
> > Wikipedias)?
> >
> > When some see Wikidata as a source of reference, they will increasingly
> be
> > served a better product. At this moment it is not good at all.
> >
> > When German Wikimedians have concerns about quality.WONDERFUL but what
> have
> > they done to improve things? Do they apply Wikipedia standards and how
> does
> > that help?
> >
> > You wonder why have "bad" data in the first place... Our data IS bad and
> > there is not enough of it for it to be really useful. We can easily add
> > more data and have a more useful result We can easily compare sources and
> > ask people to concentrate on differences. However you can not tell me to
> > add Sources to the data that I add. I will tell you to do it yourself. I
> am
> > happy to improve on quality but on my terms, not yours.
> >
> > You mention the propagation of errors.. How would that work. You indicate
> > that there are not enough people to fix all the issues. With bots like
> > Kian, we have probability in adding data. We have people add data where
> the
> > software is not certain.  You doubt technology but you do not know where
> we
> > are, what is already done.
> >
> > In short my feeling is that you do not know what you are talking about.
> > There is real scholarship in the approach that I described, My take is in
> > applying set theory. Kian is AI. For all I care yours is FUD.
> >
> > Your notion of accountability is one of a consumer, it is not the
> > accountability needed for a project that is immature and is not at all
> at a
> > stage where you should imply that it is good enough and that quality is
> > assured. There are domains in Wikidata that I will not touch because in
> my
> > opinion it is wrong in its principles. At the same time I know that it
> can
> > be fixed in time and leave it at that,
> >
> > I disagree with Heather Ford and Mark Graham. As long as Wikidata does
> not
> > have the power of a Reasonator, the data is just that. It does not make
> > itself in information and consequently it is awful. When there is one
> thing
> > the Wikidata engineers do not do, it is considering the use of the data
> and
> > the workflows to improve the data and the quality.
> >
> > The data needs to be CC-0 because it is how we ensure that everybody will
> > be happy and willing to participate. As more participation happens as
> more
> > collaboration occurs we will see Wikidata increase in the amount of data
> > that it holds and at the same time we will see quality improve.
> >
> > Yes, Wikidata could do more in the way of adding sources to data. As long
> > as the "primary sources tool" does not add the sources it knows, what do
> > you expect from anybody else.
> > Thanks,
> >      GerardM
> >
> >
> > [1] https://tools.wmflabs.org/wikidata-todo/stats.php?reverse
> >
> >
> >
> > On 27 November 2015 at 12:08, Andreas Kolbe <[email protected]> wrote:
> >
> > > Gerard,
> > >
> > > On Tue, Nov 24, 2015 at 7:15 AM, Gerard Meijssen <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hoi,
> > > > To start of, results from the past are no indications of results in
> the
> > > > future. It is the disclaimer insurance companies have to state in all
> > > their
> > > > adverts in the Netherlands. When you continue and make it a
> > "theological"
> > > > issue, you lose me because I am not of this faith, far from it.
> > Wikidata
> > > is
> > > > its own project and it is utterly dissimilar from Wikipedia.To start
> of
> > > > Wikidata has been a certified success from the start. The improvement
> > it
> > > > brought by bringing all interwiki links together is enormous.That
> alone
> > > > should be a pointer that Wikipedia think is not realistic.
> > > >
> > >
> > >
> > > These benefits are internal to Wikimedia and a completely separate
> issue
> > > from third-party re-use of Wikidata content as a default reference
> > source,
> > > which is the issue of concern here.
> > >
> > >
> > > To continue, people have been importing data into Wikidata from the
> > start.
> > > > They are the statements you know and, it was possible  to import them
> > > from
> > > > Wikipedia because of these interwiki links. So when you call for
> > sources,
> > > > it is fairly save to assume that those imports are supported by the
> > > quality
> > > > of the statements of the Wikipedias
> > >
> > >
> > >
> > > The quality of three-quarters of the 280+ Wikipedia language versions
> is
> > > about at the level the English Wikipedia had reached in 2002.
> > >
> > > Even some of the larger Wikipedias have significant problems. The
> Kazakh
> > > Wikipedia for example is controlled by functionaries of an oppressive
> > > regime[1], and the Croatian one is reportedly[2] controlled by fascists
> > > rewriting history (unless things have improved markedly in the Croatian
> > > Wikipedia since that report, which would be news to me). The
> Azerbaijani
> > > Wikipedia seems to have problems as well.
> > >
> > > The Wikimedia movement has always had an important principle: that all
> > > content should be traceable to a "reliable source". Throughout the
> first
> > > decade of this movement and beyond, Wikimedia content has never been
> > > considered a reliable source. For example, you can't use a Wikipedia
> > > article as a reference in another Wikipedia article.
> > >
> > > Another important principle has been the disclaimer: pointing out to
> > people
> > > that the data is anonymously crowdsourced, and that there is no
> guarantee
> > > of reliability or fitness for use.
> > >
> > > Both of these principles are now being jettisoned.
> > >
> > > Wikipedia content is considered a reliable source in Wikidata, and
> > Wikidata
> > > content is used as a reliable source by Google, where it appears
> without
> > > any indication of its provenance. This is a reflection of the fact that
> > > Wikidata, unlike Wikipedia, comes with a CC0 licence. That decision
> was,
> > I
> > > understand, made by Denny, who is both a Google employee and a WMF
> board
> > > member.
> > >
> > > The benefit to Google is very clear: this free, unattributed content
> adds
> > > value to Google's search engine result pages, and improves Google's
> > revenue
> > > (currently running at about $10 million an hour, much of it from ads).
> > >
> > > But what is the benefit to the end user? The end user gets information
> of
> > > undisclosed provenance, which is presented to them as authoritative,
> even
> > > though it may be compromised. In what sense is that an improvement for
> > > society?
> > >
> > > To me, the ongoing information revolution is like the 19th century
> > > industrial revolution done over. It created whole new categories of
> > abuse,
> > > which it took a century to (partly) eliminate. But first, capitalists
> > had a
> > > field day, and the people who were screwed were the common folk. Could
> we
> > > not try to learn from history?
> > >
> > >
> > >
> > > > and if anything, that is also where
> > > > they typically fail because many assumptions at Wikipedia are plain
> > wrong
> > > > at Wikidata. For instance a listed building is not the organisation
> the
> > > > building is known for. At Wikidata they each need their own item and
> > > > associated statements.
> > > >
> > > > Wikidata is already a success for other reasons. VIAF no longer links
> > to
> > > > Wikipedia but to Wikidata. The biggest benefit of this move is for
> > people
> > > > who are not interested in English.  Because of this change VIAF links
> > > > through Wikidata to all Wikipedias not only en.wp. Consequently
> people
> > > may
> > > > find through VIAF Wikipedia articles in their own language through
> > their
> > > > library systems.
> > > >
> > >
> > >
> > > At the recent Wikiconference USA, a Wikimedia veteran and professional
> > > librarian expressed the view to me that
> > >
> > > * circular referencing between VIAF and Wikidata will create a
> humongous
> > > muddle that nobody will be able to sort out again afterwards, because –
> > > unlike wiki mishaps in other topic areas – here it's the most
> > authoritative
> > > sources that are being corrupted by circular referencing;
> > >
> > > * third parties are using Wikimedia content as a *reference standard
> > *when
> > > that was never the intention (see above).
> > >
> > > I've seen German Wikimedians express concerns that quality assurance
> > > standards have dropped alarmingly since the project began, with bot
> users
> > > mass-importing unreliable data.
> > >
> > >
> > >
> > > > So do not forget about Wikipedia and the lessons learned. These
> lessons
> > > are
> > > > important to Wikipedia. However, they do not necessarily apply to
> > > Wikidata
> > > > particularly when you approach Wikidata as an opportunity to do
> things
> > > in a
> > > > different way. Set theory, a branch of mathematics, is exactly what
> we
> > > > need. When we have data at Wikidata of a given quality.. eg 90% and
> we
> > > have
> > > > data at another source with a given quality eg 90%, we can compare
> the
> > > two
> > > > and find a subset where the two sources do not match. When we curate
> > the
> > > > differences, it is highly likely that we improve quality at Wikidata
> or
> > > at
> > > > the other source.
> > >
> > >
> > >
> > > This sounds like "Let's do it quick and dirty and worry about the
> > problems
> > > later".
> > >
> > > I sometimes get the feeling software engineers just love a programming
> > > challenge, because that's where they can hone and display their skills.
> > > Dirty data is one of those challenges: all the clever things one can do
> > to
> > > clean up the data! There is tremendous optimism about what can be done.
> > But
> > > why have bad data in the first place, starting with rubbish and then
> > > proving that it can be cleaned up a bit using clever software?
> > >
> > > The effort will make the engineer look good, sure, but there will
> always
> > be
> > > collateral damage as errors propagate before they are fixed. The
> > engineer's
> > > eyes are not typically on the content, but on their software. The
> content
> > > their bots and programs manipulate at times seems almost incidental,
> > > something for "others" to worry about – "others" who don't necessarily
> > > exist in sufficient numbers to ensure quality.
> > >
> > > In short, my feeling is that the engineering enthusiasm and expertise
> > > applied to Wikidata aren't balanced by a similar level of commitment to
> > > scholarship in generating the data, and getting them right first time.
> > >
> > > We've seen where that approach can lead with Wikipedia. Wikipedia
> hoaxes
> > > and falsehoods find their way into the blogosphere, the media, even the
> > > academic literature. The stakes with Wikidata are potentially much
> > higher,
> > > because I fear errors in Wikidata stand a good chance of being
> massively
> > > propagated by Google's present and future automated information
> delivery
> > > mechanisms, which are completely opaque. Most internet users aren't
> even
> > > aware to what extent the Google Knowledge Graph relies on anonymously
> > > compiled, crowdsourced data; they will just assume that if Google says
> > it,
> > > it must be true.
> > >
> > > In addition to honest mistakes, transcription errors, outdated info
> etc.,
> > > the whole thing is a propagandist's wet dream. Anonymous accounts!
> > > Guaranteed identity protection! Plausible deniability! No legal
> > liability!
> > > Automated import and dissemination without human oversight! Massive
> > impact
> > > on public opinion![3]
> > >
> > > If information is power, then this provides the best chance of a power
> > grab
> > > humanity has seen since the invention of the newspaper. In the media
> > > landscape, you at least have right-wing, centrist and left-wing
> > > publications each presenting their version of the truth, and you know
> > who's
> > > publishing what and what agenda they follow. You can pick and choose,
> > > compare and contrast, read between the lines. We won't have that
> online.
> > > Wikimedia-fuelled search engines like Google and Bing dominate the
> > > information supply.
> > >
> > > The right to enjoy a pluralist media landscape, populated by players
> who
> > > are accountable to the public, was hard won in centuries past. Some
> > > countries still don't enjoy that luxury today. Are we now blithely
> giving
> > > it away, in the name of progress, and for the greater glory of
> > technocrats?
> > >
> > > I don't trust the way this is going. I see a distinct possibility that
> > > we'll end up with false information in Wikidata (or, rather, the Google
> > > Knowledge Graph) being used to "correct" accurate information in other
> > > sources, just because the Google/Wikidata content is ubiquitous. If you
> > > build circular referencing loops fuelled by spurious data, you don't
> > > provide access to knowledge, you destroy it. A lie told often enough
> etc.
> > >
> > > To quote Heather Ford and Mark Graham, "We know that the engineers and
> > > developers, volunteers and passionate technologists are often trying to
> > do
> > > their best in difficult circumstances. But there need to be better
> > attempts
> > > by people working on these platforms to explain how decisions are made
> > > about what is represented. These may just look like unimportant lines
> of
> > > code in some system somewhere, but they have a very real impact on the
> > > identities and futures of people who are often far removed from the
> > > conversations happening among engineers."
> > >
> > > I agree with that. The "what" should be more important than the "how",
> > and
> > > at present it doesn't seem to be.
> > >
> > > It's well worth thinking about, and having a debate about what can be
> > done
> > > to prevent the worst from happening.
> > >
> > > In particular, I would like to see the decision to publish Wikidata
> > under a
> > > CC0 licence revisited. The public should know where the data it gets
> > comes
> > > from; that's a basic issue of transparency.
> > >
> > > Andreas
> > >
> > > [1]
> > >
> >
> https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-10-07/Op-ed
> > > [2]
> > >
> > >
> >
> http://www.dailydot.com/politics/croatian-wikipedia-fascist-takeover-controversy-right-wing/
> > > [3]
> > >
> > >
> >
> http://www.politico.com/magazine/story/2015/08/how-google-could-rig-the-2016-election-121548
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at:
> > > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > > [email protected]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[email protected]?subject=unsubscribe>
> > >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > [email protected]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[email protected]?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> [email protected]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[email protected]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
[email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] Quality issues

Reply via email to