[Wikidata] Re: State of the (Wiki)data

2022-11-02 Thread Markus Krötzsch

Dear all,

Thanks, Romaine, for this detailed and careful analysis of the 
situation. I think much of this is spot-on. I think one of the main 
insights here is that we need more uniformity. Wikidata in many places 
is still used like some exotic "structured" format for entering plain 
texts, which make sense to human readers but prevent or confuse 
automated usage. The key is to "see" collections of items rather than 
single pages.


It seems Wikidata would need more stakeholder communities for specific 
areas (say sports events) to oversee and guide the modeling of the items 
in this kind. We need more WikiProjects.


Regarding the question whether solutions need to be technical or social, 
I'd say both must go together. I also have often been disheartened by 
the sheer effort that it would require to add even the most obvious 
statements to a larger set of items. Geography is a good example: there 
are so many nearby places that share the same geo-administrative history 
(take a look at the country, P17, of Dresden, Q1731), yet it is 
practically impossible to add this to any significant amount of the 
thousands of Germany cities ... Here, like in many of the cases Romaine 
has described, the technical limitations may smother necessary community 
activity. (The specific case might also be an example of something where 
an approach of "data sharing" is needed, i.e. a modeling paradigm that 
simply allows us to say "this place has the same history of P17 
statements as this other place"; but that's not the main topic of this 
post).


New tools may also enable and encourage communities to grow that have 
not formed in the past decade. One aspect here might be that it is 
difficult for communities to appreciate the result of their efforts. For 
example, it is very difficult to create a uniform appearance for a group 
of pages, already since the order of statements (in a group of the same 
property) is so hard to change, and also since the pages are already 
very long. Even if one can achieve complete semantic uniformity, one 
will not currently have much opportunity to "see" this success. There 
are unsolved challenges here that cannot be compared with the relatively 
simple and small data that one can find in a typical Wikipedia Infobox. 
External developers and maybe even researchers could contribute here, 
but they would also benefit form the input and concrete ideas from 
WikiProjects (Romain's email already had quite a number of directly 
implementable ideas in it ... this kind of constructive input is already 
half of the solution).


Cheers,

Markus


On 31/10/2022 23:40, Romaine Wiki wrote:
Yesterday it was 10 years ago when Wikidata was founded and two weeks 
ago Wikidata reached the amount of 100 million items. This is a good 
moment to see what we have (and don't have), to look a bit back, and 
also some hope for the future.


The idea to describe this already started in September and since then I 
have done various analysis to get a picture. This, however, will not be 
a complete overview as there are too many factors involved, just a 
general picture of what I came across.


(Spoiler: This e-mail gets more structure further below. :-p)

== Structured? ==

Wikidata, it is said it contains structured data. I think we need to be 
more precise with it: it is how the data is stored that is structured. 
And this structured data is _only_ present on an individual item. If we 
zoom out a little bit, and view multiple items of a serie, among items 
the data is often missing, fragmented, differently organised, and 
sometimes even problematic. On a multi-item-level (serie-level) it 
highly depends if a user has done all the work to synchronise the 
various items all together or not.


*Example:* I came across a serie of items about a certain sports 
tournament with an edition organised each year for 50 years on a row. 
For P31 (instance of), on 5 items it was called an event, on 25 items it 
was called a sporting event, on on 13 items a tournament, on some others 
a competition, and a few without P31. To be clear, each edition had the 
same setup, was for the same sport, everything the same. The articles on 
Wikipedia are better structured!


This is just a simple serie of items. Zooming out another level, the 
differences between series are huge, which makes the quality low.


How is a new item added? In the past ten years many items have been 
added with bots/tools based on the articles on Wikipedia. (Yes, for I 
ignore here other additions.) In future still many items will be created 
when an article on Wikipedia has been created. In the worst case, the 
user adds the sitelink and the items stays empty (practically useless!). 
A little bit better, the user adds P31/P279 (instance of/subclass of) 
(not useful, but it helps). A bit more better, also other statements are 
added (an item becomes useful). Better when a user checks one/two other 
items in a series. Much better when a user checks all items 

[Wikidata] Re: State of the (Wiki)data

2022-11-01 Thread Peter F. Patel-Schneider
I agree with all these criticisms of the information in Wikidata. There are  
quite a few important classes in Wikidata where there are missing, 
questionable, or incorrect structural data.  Look at colors (instances of 
Q1075), where some colors are both instances and subclasses of color; or ships 
(instances of Q11446), where some ships are subclasses of ship; or the 
superclasses of geographic region (Q82794), which include set; or the 
instances of woman (Q467), of which there are only 28.


I believe that these structural problems in Wikidata are a major, probably the 
major, reason that Wikidata does not have considerably more uptake than it 
currently does.  Certainly every time I think of using Wikidata I have to 
think hard about what I need to do to ensure that the structural problems in 
Wikidata will not pose too much of a problem for my use.  (In most cases I 
come to the reluctant conclusion that they will.)



It's not so much that there are examples of bad structural data, it is that 
examples are so easy to find.  And it's not so much that the problems arise 
from bad policies, it is that there are no enforced policies.  And it's even 
not so much that these are unknown problems as most of them have been 
previously reported.


It is for the above reasons that I believe that lack of tool support is not 
the major driver of the problems, and certainly tools that can only point out 
problems are not going to be a significant help in solving the problems.  
Instead I believe that what is driving the structural problems with Wikidata 
is that there is insufficient effort paid by the Wikidata community to 
identify and implement fixes for the structural problems.  Tool support is 
important, I agree, but without people in the Wikidata community putting a 
higher priority on fixing data in Wikidata than even adding more data to 
Wikidata structural problems will continue.


I also feel that it does very little good to ask people who are adding new 
data to Wikidata to only create data with good structure when there are so may 
existing problems.  Instead the existing problems first need to be fixed up.  
This will both show that the Wikidata community cares about good structure and 
show people who are adding new data how new data should be added instead of 
the current situation which in too many cases provides examples of how not to 
structure data.  Consider a tool that retrieves items that are similar to an 
item being added.  If this comparison item has bad structuring nearby it is 
very likely that the new item will be either given similar or linked to the 
existing bad structuring.




As far as labels, descriptions, and aliases go I agree that the current 
situation is poor.  But what I believe is missing most is enough description 
that the intent of an item, particularly a class, can be correctly 
determined.  I often end up with only a poor idea of what items should be an 
instance of a class, particularly when considering several classes at once.  
The various geographic classes are a prime example here for me.  In my view 
many of the natural language information associated with Wikidata items should 
be tagged with the English Wikipedia multiple issues template.




Queries that show the above problems:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q1075.
  ?item wdt:P279* wd:Q1075.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q11446.
  ?item wdt:P279* wd:Q11446.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  }

SELECT ?item ?itemLabel WHERE {
  wd:Q82794 wdt:P279* ?item .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}


SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q467.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}


Peter F. Patel-Schneider

___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/GERAOWK3O56Z2YY4KHGZO4IGCXXXZK32/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: State of the (Wiki)data

2022-11-01 Thread Thad Guidry
Reading through all this carefully and taking notes along the way it
appeared to me that ShEx (and better easier tooling for it) could help in
about 50% of your future wants/needs.

Great thoughts and thanks for sharing!

On Tue, Nov 1, 2022 at 6:41 AM Romaine Wiki  wrote:

> Yesterday it was 10 years ago when Wikidata was founded and two weeks ago
> Wikidata reached the amount of 100 million items. This is a good moment to
> see what we have (and don't have), to look a bit back, and also some hope
> for the future.
>
> The idea to describe this already started in September and since then I
> have done various analysis to get a picture. This, however, will not be a
> complete overview as there are too many factors involved, just a general
> picture of what I came across.
>
> (Spoiler: This e-mail gets more structure further below. :-p)
>
> == Structured? ==
>
> Wikidata, it is said it contains structured data. I think we need to be
> more precise with it: it is how the data is stored that is structured. And
> this structured data is *only* present on an individual item. If we zoom
> out a little bit, and view multiple items of a serie, among items the data
> is often missing, fragmented, differently organised, and sometimes even
> problematic. On a multi-item-level (serie-level) it highly depends if a
> user has done all the work to synchronise the various items all together or
> not.
>
> *Example:* I came across a serie of items about a certain sports
> tournament with an edition organised each year for 50 years on a row. For
> P31 (instance of), on 5 items it was called an event, on 25 items it was
> called a sporting event, on on 13 items a tournament, on some others a
> competition, and a few without P31. To be clear, each edition had the same
> setup, was for the same sport, everything the same. The articles on
> Wikipedia are better structured!
>
> This is just a simple serie of items. Zooming out another level, the
> differences between series are huge, which makes the quality low.
>
> How is a new item added? In the past ten years many items have been added
> with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here
> other additions.) In future still many items will be created when an
> article on Wikipedia has been created. In the worst case, the user adds the
> sitelink and the items stays empty (practically useless!). A little bit
> better, the user adds P31/P279 (instance of/subclass of) (not useful, but
> it helps). A bit more better, also other statements are added (an item
> becomes useful). Better when a user checks one/two other items in a series.
> Much better when a user checks all items of the row of subjects. And
> fantastic when a user checks all items in a series and in other series.
>
> Realistic for most new items? No, this is way too much effort. At the same
> time, to get quality data, it is needed.
>
> *Example:* About a month ago there were 13 000 items with a sitelink to
> the Dutch Wikipedia without the basic statements P31/P279. This is just one
> language version, we have hundreds of wikis!
>
> After some time after a new article has been written, users use a bot/tool
> to mass import new articles from Wikipedia to Wikidata with zero/little
> statements. We should be happy that they do this work, but these items are
> largely empty and do not contain useful/needed data. Also many duplicates
> are created this way. We need to go to the source and find a solution
> there, re-thinking the workflow, otherwise we keep mopping with the tap
> open.
>
> *Needed for the future:* a "new article to Wikidata wizard". I imagine
> that when a user is ready with writing an article, he clicks on Publish
> page. As soon as the page is saved the user gets a pop-up dialogue. The
> user is first asked (in the dialogue) to search in Wikidata to see if
> already an item exists about this subject. With a completely new subject or
> empty item, the second step is that the dialogue suggests (based on the
> published article) a few statements the user can click and confirm. Most
> new articles are about subjects that are part of some sort of series or
> about a subject with a default set of properties we expect to be always
> present (like a building: country, located in the administrative
> territorial entity and coordinates).
>
> I think we can be more precise about what Wikidata contains: it contains
> chaotic data in a structured way, which is often not structurally added nor
> maintained.
>
> To get more quality, we not only must have the data structured on items
> and among items, but also the way how we think about working with the data
> needs more structure. We currently work with individual items, and without
> an integral perspective on the data: we have no overview.
>
>
> == Wikidata gives no overview ==
>
> I sometimes heard users say that Wikidata can provide an overview. That is
> however not true. Wikidata does not give an overview!  Wikidata can't give
> an