Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Dario Taraborelli
I second this. For a related effort, see:

https://github.com/pav-ontology/pav/

in particular, pav:sourceLastAccessedOn, pav:lastRefreshedOn, pav:lastUpdateOn
http://pav-ontology.github.io/pav/#d4e846

> On Jun 3, 2015, at 3:56 PM, Markus Krötzsch  
> wrote:
> 
> On 03.06.2015 13:57, Magnus Manske wrote:
>> Maybe there is a case to separate import and verification here?
>> 
>> There are many statements in Wikidata nowadays, but they get really
>> "trustworthy" through references (other than "imported from Wikipedia").
>> But for external IDs, references are superfluous; they are their own
>> reference, by definition. So how about marking IDs with a "verified" (or
>> "last verified on") qualifier? Much of such work could be done by bots;
>> we could then filter the problematic ones out for manual verification.
>> 
>> As we have no control over external lists, this would have to be
>> re-checked ever so often; but, again bots to the rescue.
>> 
> 
> Yes, I fully support this proposal.
> 
> What do you think about making "last verified on" not a qualifier but (part 
> of) the reference information? The reference could state where the bot has 
> looked up the ID and give a time. This would be somewhat similar to what is 
> now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.
> 
> In general, it might be useful to have such a "last verified on" property 
> that can be added to arbitrary references. There are many other uses for 
> this. One common case would be that a user has changed the value without even 
> being aware of the reference -- then one would be able to detect this 
> automatically by comparing the last modification time with the "last verified 
> on" date.
> 
> Putting the "last verified on" into the references also makes it possible to 
> have different dates for different references there.
> 
> Regards,
> 
> Markus
> 
> 
> 
> 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Markus Krötzsch

On 03.06.2015 13:57, Magnus Manske wrote:

Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really
"trustworthy" through references (other than "imported from Wikipedia").
But for external IDs, references are superfluous; they are their own
reference, by definition. So how about marking IDs with a "verified" (or
"last verified on") qualifier? Much of such work could be done by bots;
we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be
re-checked ever so often; but, again bots to the rescue.



Yes, I fully support this proposal.

What do you think about making "last verified on" not a qualifier but 
(part of) the reference information? The reference could state where the 
bot has looked up the ID and give a time. This would be somewhat similar 
to what is now used in Freebase Ids, e.g., in 
https://www.wikidata.org/wiki/Q42.


In general, it might be useful to have such a "last verified on" 
property that can be added to arbitrary references. There are many other 
uses for this. One common case would be that a user has changed the 
value without even being aware of the reference -- then one would be 
able to detect this automatically by comparing the last modification 
time with the "last verified on" date.


Putting the "last verified on" into the references also makes it 
possible to have different dates for different references there.


Regards,

Markus






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Markus Krötzsch

Thanks, Andrew, for the clarification. This makes perfect sense.

I don't see a problem with one bridge having two IDs in some external 
database. We already have this for other ID-like properties for other 
reasons. What is important though is that it still is a single bridge, 
and should therefore be one item.


Your clarification is reassuring since it suggests that the problem is 
not overly common after all. Maybe one can just merge these cases 
manually. Once the (multiple) ids are found in the merged items, 
avoiding future duplicates will be done as usual (which is still 
difficult with the Scottish Heritage ids since we have many legit 
Wikidata items that have the same id -- but this at least is an 
independent problem).


Regards,

Markus

On 03.06.2015 13:48, Andrew Gray wrote:

This particular case is something of a known problem - we've
encountered it with some of the other heritage-building identifier
lists as well.

Bridges often span a river which is the border for two jurisdictions
(in this case, council areas). Each local area counts it as a historic
building, and because the national lists are aggregated from local
lists, it gets two entries in the main list, one as Fife and one as
Edinburgh. A similar case in Wales is the Menai Suspension Bridge,
which is 4049 from the Gwynedd register and 18572 from the Anglesey
one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug,
and both entries are "correct". Perhaps one way to handle this for
Wikidata would be to, hmm, say something like "if the item is some
kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders
*and* are a heritage building in both countries, but we'd see the same
thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch  wrote:

Another interesting type of Scottish historic orphans are those that are
duplicates of items that do have site links. Even very prominent ones are
duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup)
https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that
Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:47778
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:49165

Overall, this seems to be an example of an ID that really should not be
considered "identity providing" since there seems to be an many-to-many
relationship between Wikidata and Historic Scottland. Orphans should receive
additional ids from a better source if at all possible. With the great
number of seemingly legit non-functional uses of the Scotland IDs, they
cannot be used in practice to detect duplicates.

Regards,

Markus



On 02.06.2015 13:01, Markus Krötzsch wrote:


On 02.06.2015 11:30, Magnus Manske wrote:


Update 2:
For example,
https://www.wikidata.org/wiki/Q17847522
and
https://www.wikidata.org/wiki/Q17847537
have the same Scotland ID, but refer to different entities (church and
churchyard, respectively). They were as two entities in the original
dataset, sharing the same ID.



Yes, I noticed such cases too. From the information Wikidata, it is not
clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but
different items that each have their own coordinates (where did the
coordinates come from?):

https://www.wikidata.org/wiki/Q17576211
https://www.wikidata.org/wiki/Q17576182
https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into
one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item
label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have
several items, but they have the same labels that merge the contents of
the two IDs:

https://www.wikidata.org/wiki/Q17810121
https://www.wikidata.org/wiki/Q17810137


It seems that the data was not taken from the Historic Sites database
but from some different source that has its own coordinate data and a
different (but seemingly arbitrary) approach to grouping sites. However,
the coordinated give Historic Scotland as their reference -- I wonder if
Historic Scotland might be changing frequently or exist in several
versions.

Regards,

Markus




On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske
mailto:magnusman...@googlemail.com>> wrote:

 Update: There appear to be quite a few items with duplicate Scotland
 IDs (not all of them may be erroneous!):
 http://wdq.wmflabs.org/stats?action=doublestring&prop=709

 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
 mailto:magnusman...@googlemail.com>>
 wrote:

 I created (some/

Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Andy Mabbett
On 3 June 2015 at 12:48, Andrew Gray  wrote:

> The lack of deduplication is probably intentional rather than a bug,
> and both entries are "correct". Perhaps one way to handle this for
> Wikidata would be to, hmm, say something like "if the item is some
> kind of a bridge, then allow two IDs" in the constraints?

The constraint should be "usually one ID" (i.e. "SHOULD only have one
ID), not "MUST have only one ID.

Wikidata already allows for this, and the constraints are editable.

See also the talk page and report for P496 for an example of a listed exception.

-- 
Andy Mabbett
@pigsonthewing
http://pigsonthewing.org.uk

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Magnus Manske
Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really
"trustworthy" through references (other than "imported from Wikipedia").
But for external IDs, references are superfluous; they are their own
reference, by definition. So how about marking IDs with a "verified" (or
"last verified on") qualifier? Much of such work could be done by bots; we
could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked
ever so often; but, again bots to the rescue.

On Wed, Jun 3, 2015 at 12:49 PM Andrew Gray 
wrote:

> This particular case is something of a known problem - we've
> encountered it with some of the other heritage-building identifier
> lists as well.
>
> Bridges often span a river which is the border for two jurisdictions
> (in this case, council areas). Each local area counts it as a historic
> building, and because the national lists are aggregated from local
> lists, it gets two entries in the main list, one as Fife and one as
> Edinburgh. A similar case in Wales is the Menai Suspension Bridge,
> which is 4049 from the Gwynedd register and 18572 from the Anglesey
> one (Wikidata, at Q581526, only lists one identifer).
>
> The lack of deduplication is probably intentional rather than a bug,
> and both entries are "correct". Perhaps one way to handle this for
> Wikidata would be to, hmm, say something like "if the item is some
> kind of a bridge, then allow two IDs" in the constraints?
>
> I can't immediately think of any bridges which cross national borders
> *and* are a heritage building in both countries, but we'd see the same
> thing there, with it having identifiers from both sides.
>
> Andrew.
>
> On 2 June 2015 at 12:12, Markus Krötzsch 
> wrote:
> > Another interesting type of Scottish historic orphans are those that are
> > duplicates of items that do have site links. Even very prominent ones are
> > duplicated, such as
> >
> > https://www.wikidata.org/wiki/Q17569486 (dup)
> > https://www.wikidata.org/wiki/Q933000 (real item)
> >
> > Interestingly, they use different Scotland IDs, and it does indeed seem
> that
> > Historic Scotland also contains duplicates:
> >
> >
> http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:47778
> >
> http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:49165
> >
> > Overall, this seems to be an example of an ID that really should not be
> > considered "identity providing" since there seems to be an many-to-many
> > relationship between Wikidata and Historic Scottland. Orphans should
> receive
> > additional ids from a better source if at all possible. With the great
> > number of seemingly legit non-functional uses of the Scotland IDs, they
> > cannot be used in practice to detect duplicates.
> >
> > Regards,
> >
> > Markus
> >
> >
> >
> > On 02.06.2015 13:01, Markus Krötzsch wrote:
> >>
> >> On 02.06.2015 11:30, Magnus Manske wrote:
> >>>
> >>> Update 2:
> >>> For example,
> >>> https://www.wikidata.org/wiki/Q17847522
> >>> and
> >>> https://www.wikidata.org/wiki/Q17847537
> >>> have the same Scotland ID, but refer to different entities (church and
> >>> churchyard, respectively). They were as two entities in the original
> >>> dataset, sharing the same ID.
> >>
> >>
> >> Yes, I noticed such cases too. From the information Wikidata, it is not
> >> clear to me why this is sometimes done and sometimes not done.
> >>
> >> For example, these adjacent houses have the same Scotland ID but
> >> different items that each have their own coordinates (where did the
> >> coordinates come from?):
> >>
> >> https://www.wikidata.org/wiki/Q17576211
> >> https://www.wikidata.org/wiki/Q17576182
> >> https://www.wikidata.org/wiki/Q17576185
> >>
> >> In many other cases, adjacent houses with the same ID are combined into
> >> one item:
> >>
> >> https://www.wikidata.org/wiki/Q17806587
> >>
> >> (note, however, that the house addresses given in the ID and in the item
> >> label do not match, though they overlap on most of the houses.)
> >>
> >> Finally, there are also cases where there are different IDs and we have
> >> several items, but they have the same labels that merge the contents of
> >> the two IDs:
> >>
> >> https://www.wikidata.org/wiki/Q17810121
> >> https://www.wikidata.org/wiki/Q17810137
> >>
> >>
> >> It seems that the data was not taken from the Historic Sites database
> >> but from some different source that has its own coordinate data and a
> >> different (but seemingly arbitrary) approach to grouping sites. However,
> >> the coordinated give Historic Scotland as their reference -- I wonder if
> >> Historic Scotland might be changing frequently or exist in several
> >> versions.
> >>
> >> Regards,
> >>
> >> Markus
> >>
> >>
> >>>
> >>> On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske
> >>> mailto:magnusman...@googlemail.com>>
> wrote:
> >>>
> >>> Update: There app

Re: [Wikidata] [Spam] Re: No links, wrong data: Scotland's orphans need help

2015-06-03 Thread Andrew Gray
This particular case is something of a known problem - we've
encountered it with some of the other heritage-building identifier
lists as well.

Bridges often span a river which is the border for two jurisdictions
(in this case, council areas). Each local area counts it as a historic
building, and because the national lists are aggregated from local
lists, it gets two entries in the main list, one as Fife and one as
Edinburgh. A similar case in Wales is the Menai Suspension Bridge,
which is 4049 from the Gwynedd register and 18572 from the Anglesey
one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug,
and both entries are "correct". Perhaps one way to handle this for
Wikidata would be to, hmm, say something like "if the item is some
kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders
*and* are a heritage building in both countries, but we'd see the same
thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch  wrote:
> Another interesting type of Scottish historic orphans are those that are
> duplicates of items that do have site links. Even very prominent ones are
> duplicated, such as
>
> https://www.wikidata.org/wiki/Q17569486 (dup)
> https://www.wikidata.org/wiki/Q933000 (real item)
>
> Interestingly, they use different Scotland IDs, and it does indeed seem that
> Historic Scotland also contains duplicates:
>
> http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:47778
> http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0BUILDING,HL:49165
>
> Overall, this seems to be an example of an ID that really should not be
> considered "identity providing" since there seems to be an many-to-many
> relationship between Wikidata and Historic Scottland. Orphans should receive
> additional ids from a better source if at all possible. With the great
> number of seemingly legit non-functional uses of the Scotland IDs, they
> cannot be used in practice to detect duplicates.
>
> Regards,
>
> Markus
>
>
>
> On 02.06.2015 13:01, Markus Krötzsch wrote:
>>
>> On 02.06.2015 11:30, Magnus Manske wrote:
>>>
>>> Update 2:
>>> For example,
>>> https://www.wikidata.org/wiki/Q17847522
>>> and
>>> https://www.wikidata.org/wiki/Q17847537
>>> have the same Scotland ID, but refer to different entities (church and
>>> churchyard, respectively). They were as two entities in the original
>>> dataset, sharing the same ID.
>>
>>
>> Yes, I noticed such cases too. From the information Wikidata, it is not
>> clear to me why this is sometimes done and sometimes not done.
>>
>> For example, these adjacent houses have the same Scotland ID but
>> different items that each have their own coordinates (where did the
>> coordinates come from?):
>>
>> https://www.wikidata.org/wiki/Q17576211
>> https://www.wikidata.org/wiki/Q17576182
>> https://www.wikidata.org/wiki/Q17576185
>>
>> In many other cases, adjacent houses with the same ID are combined into
>> one item:
>>
>> https://www.wikidata.org/wiki/Q17806587
>>
>> (note, however, that the house addresses given in the ID and in the item
>> label do not match, though they overlap on most of the houses.)
>>
>> Finally, there are also cases where there are different IDs and we have
>> several items, but they have the same labels that merge the contents of
>> the two IDs:
>>
>> https://www.wikidata.org/wiki/Q17810121
>> https://www.wikidata.org/wiki/Q17810137
>>
>>
>> It seems that the data was not taken from the Historic Sites database
>> but from some different source that has its own coordinate data and a
>> different (but seemingly arbitrary) approach to grouping sites. However,
>> the coordinated give Historic Scotland as their reference -- I wonder if
>> Historic Scotland might be changing frequently or exist in several
>> versions.
>>
>> Regards,
>>
>> Markus
>>
>>
>>>
>>> On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske
>>> mailto:magnusman...@googlemail.com>> wrote:
>>>
>>> Update: There appear to be quite a few items with duplicate Scotland
>>> IDs (not all of them may be erroneous!):
>>> http://wdq.wmflabs.org/stats?action=doublestring&prop=709
>>>
>>> On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
>>> mailto:magnusman...@googlemail.com>>
>>> wrote:
>>>
>>> I created (some/most of) these items as part of the Wiki Loves
>>> Monuments UK 2014 drive, to run the campaign from Wikidata
>>> rather than from a bespoke database. This allows the community
>>> (TM) to maintain the data, rather than one poor sod (e.g.,
>>> myself) having to frantically update all of it every year ;-)
>>>
>>> "Consumer" tool is here:
>>> https://tools.wmflabs.org/wlmuk/index_wd.html
>>>
>>> These are based on "official" data from National Heritage,
>>> provided to me via Wikimedia UK. Grade A (or Grade