Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-08 Thread Egon Willighagen
On Sat, Jul 7, 2018 at 5:59 PM mathieu lovato stumpf guntz <
psychosl...@culture-libre.org> wrote:

> I agree this is misconception that a copyright license make any direct
> change to data reliability. But attribution requirement does somewhat
> indirectly have an impact on it, as it legally enforce traceability.
>
I know that "law" has a special corner, but therefore not always the
best... law, in the end, is just a social construct, just like anything we
agree on. First, we all agree (it seems to me) that provenance is valuable.

However, having something in law (or contract) effectively criminalizes if
you fail to add the provenance. Is that what you really wish? Do you want
to be able to legally punish people if the fail to give provenance?
Honestly, that sounds a bit harsh to me... and to me, and this is a
personal opinion and not an argument, I think Wikidata is more open, more
inclusive than that: Wikidata offers carrots, not sticks.

Egon

-- 
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: https://www.zotero.org/egonw
ORCID: -0001-7542-0286 
ImpactStory: https://impactstory.org/u/egonwillighagen
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-07 Thread Gerard Meijssen
Hoi,
Bolderdash and Wikipedia think. When you think Wikipedia has quality, and
it has, it does not have absolute quality. I have added a lot of
information from Wikipedia to Wikidata and there is a lot that is plain
wrong from a data perspective, there are the errors and there is a lot that
is just missing. This is particularly true when the subject is not really
what people are interested in. Things like the Polk award, subdistricts of
Botswana the list is long. I am adding much of the information by hand, add
missing parts and the main use for the missing data is in the relations.

As I have said so often, quality of data is in having the same data in
multiple sources. It follows that the data that can safely be added to
Wikidata is the data where multiple sources agree on the represented facts.
This is done easiest by bots and indeed there algorithms are defined in
their code. When new data is included based on a multitude of sources, what
is the source? Particularly when data is inconsistent as multiple sources
cannot agree on specific data, sources become relevant but it is also where
you go into real research.

Arguably, when data sources differ, you easily get into disputed facts and
fake facts. This is where sourcing the facts becomes relevant. It is also
where you get into real research and where as a consequence the license of
the information becomes irrelevant.

In my opinion, we have grown up thinking in serial sourcing and
particularly when you apply this approach on data stores like Wikidata your
algorithms and thinking fails reality.
Thanks,
  GerardM

On 7 July 2018 at 19:55, Stas Malyshev  wrote:

> Hi!
>
> > I agree this is misconception that a copyright license make any direct
> > change to data reliability. But attribution requirement does somewhat
> > indirectly have an impact on it, as it legally enforce traceability.
>
> While true, I don't think it's of much practical use if traceability is
> what you are seriously interested in. Imagine Wikidata were CC-BY, so
> each piece of data you use from Wikidata now has to be marked as "coming
> from Wikidata.Org". What have you gained? Wikidata is huge, and this
> mark doesn't even tell you which item it is from, while being completely
> satisfactory legally. Even more useless it is for actually ensuring the
> data is correct or tracing its provenance to primary sources - you'd
> still have to find the item and check the references manually (or
> automatically, maybe) as you could do for CC0. CC-BY license would not
> have added very much on Wikidata side.
> All this is while, of course, even with CC0 nothing prevents you from
> importing Wikidata data in such a way that each piece of data still
> carries the mark "coming from Wikidata". While it is not a legal
> requirement with CC0, nothing in CC0 prevents that from happening. If
> your provenance needs are matched by this, there's nothing preventing
> you from doing this, and legal requirements of CC-BY do not improve it
> for you in any way - they just would force people that *do not* need to
> do it still do it.
>
> > That is I strongly disagree with the following assertion: "a license
> > that requires BY sucks so hard for data [because] attribution
> > requirements grow very quickly". To my mind it is equivalent to say that
>
> I think this assertion (that attribution requirements grow) is factually
> true. Each data piece from CC-BY data set needs to carry attribution. If
> your data needs require to combine several data sets, each of them needs
> to carry attribution. This attribution should be carried through all
> data processing pipelines. You may be OK with this growth, but as I just
> explained above, these requirements, while being onerous for people that
> don't need tracing each piece of data, are still unsatisfactory in many
> cases for those that do. So having CC-BY would be both onerous and useless.
>
> > we will throw away traceability because it is subjectively judged too
> > large a burden, without providing any start of evidence that it indeed
> > can't be managed, at least with Wikimedia current ressources.
>
> It's not Wikimedia that will be shouldering the burden, it's every user
> of Wikimedia data sets.
>
> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-07 Thread mathieu lovato stumpf guntz

Le 07/07/2018 à 19:55, Stas Malyshev a écrit :


I think this assertion (that attribution requirements grow) is factually
true. Each data piece from CC-BY data set needs to carry attribution. If
your data needs require to combine several data sets, each of them needs
to carry attribution. This attribution should be carried through all
data processing pipelines. You may be OK with this growth, but as I just
explained above, these requirements, while being onerous for people that
don't need tracing each piece of data, are still unsatisfactory in many
cases for those that do. So having CC-BY would be both onerous and useless.

Hi Stas,

The attribution need to be carried only through processing pipelines 
whose results need to be published.


Can we talk about real concrete examples where attribution would 
seriously prevent any real case use? If all this stands on solid facts, 
surely it shouldn't be too hard to come with at least one example. 
Otherwise, it is certainly useless to continue this discussion.


Cheers

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-07 Thread Stas Malyshev
Hi!

> I agree this is misconception that a copyright license make any direct
> change to data reliability. But attribution requirement does somewhat
> indirectly have an impact on it, as it legally enforce traceability.

While true, I don't think it's of much practical use if traceability is
what you are seriously interested in. Imagine Wikidata were CC-BY, so
each piece of data you use from Wikidata now has to be marked as "coming
from Wikidata.Org". What have you gained? Wikidata is huge, and this
mark doesn't even tell you which item it is from, while being completely
satisfactory legally. Even more useless it is for actually ensuring the
data is correct or tracing its provenance to primary sources - you'd
still have to find the item and check the references manually (or
automatically, maybe) as you could do for CC0. CC-BY license would not
have added very much on Wikidata side.
All this is while, of course, even with CC0 nothing prevents you from
importing Wikidata data in such a way that each piece of data still
carries the mark "coming from Wikidata". While it is not a legal
requirement with CC0, nothing in CC0 prevents that from happening. If
your provenance needs are matched by this, there's nothing preventing
you from doing this, and legal requirements of CC-BY do not improve it
for you in any way - they just would force people that *do not* need to
do it still do it.

> That is I strongly disagree with the following assertion: "a license
> that requires BY sucks so hard for data [because] attribution
> requirements grow very quickly". To my mind it is equivalent to say that

I think this assertion (that attribution requirements grow) is factually
true. Each data piece from CC-BY data set needs to carry attribution. If
your data needs require to combine several data sets, each of them needs
to carry attribution. This attribution should be carried through all
data processing pipelines. You may be OK with this growth, but as I just
explained above, these requirements, while being onerous for people that
don't need tracing each piece of data, are still unsatisfactory in many
cases for those that do. So having CC-BY would be both onerous and useless.

> we will throw away traceability because it is subjectively judged too
> large a burden, without providing any start of evidence that it indeed
> can't be managed, at least with Wikimedia current ressources.

It's not Wikimedia that will be shouldering the burden, it's every user
of Wikimedia data sets.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-07 Thread mathieu lovato stumpf guntz

Hi Andra,

I agree this is misconception that a copyright license make any direct 
change to data reliability. But attribution requirement does somewhat 
indirectly have an impact on it, as it legally enforce traceability. 
That is I strongly disagree with the following assertion: "a license 
that requires BY sucks so hard for data [because] attribution 
requirements grow very quickly". To my mind it is equivalent to say that 
we will throw away traceability because it is subjectively judged too 
large a burden, without providing any start of evidence that it indeed 
can't be managed, at least with Wikimedia current ressources.


Now, I don't say traceability is the sole factor one should take into 
account in data reliability, but certainly it is one of them. Maybe we 
should first come with clear criteria to put in a equation that enable 
to calculate reliability of information. Since it's in the core goals of 
the Wikimedia strategy, it would certainly worth the effort to establish 
clear metrics about reliability of information the movement is spreading.


Cheers


Le 04/07/2018 à 13:00, Andra Waagmeester a écrit :
I agree with Maarten and to add to that. It is a huge misconception 
that CC0  makes data unreliable. It is only a legal statement about 
copyright, nothing more, nothing less. Statements without proper 
references and qualifiers make data unreliable, but Wikidata has a 
decent mechanism to capture that needed provenance.


On Wed, Jul 4, 2018 at 12:50 PM, Maarten Dammers > wrote:


Hi Mathieu,

On 04-07-18 11:07, mathieu stumpf guntz wrote:

Hi,

Le 19/05/2018 à 03:35, Denny Vrandečić a écrit :


Regarding attribution, commonly it is assumed that you
have to respect it transitively. That is one of the
reasons a license that requires BY sucks so hard for data:
unlike with text, the attribution requirements grow very
quickly. It is the same as with modified images and
collages: it is not sufficient to attribute the last
author, but all contributors have to be attributed.

If we want our data to be trustable, then we need
traceability. That is reporting this chain of sources as
extensively as possible, whatever the license require or not
as attribution. CC-0 allow to break this traceability, which
make an aweful license to whoever is concerned with obtaining
reliable data.

A license is not the way to achieve this. We have references for that.


This is why I think that whoever wants to be part of a
large federation of data on the web, should publish under CC0.

As long as one aim at making a federation of untrustable data
banks, that's perfect. ;)

So I see you started forum shopping (trying to get the Wikimedia-l
people in) and making contentious trying to be funny remarks.
That's usually a good indication a thread is going nowhere.

No, Wikidata is not going to change the CC0. You seem to be the
only person wanting that and trying to discredit Wikidata will not
help you in your crusade. I suggest the people who are still
interested in this to go to
https://phabricator.wikimedia.org/T193728
 and make useful
comments over there.

Maarten


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata