[Wikidata] Re: State of the (Wiki)data
Dear all, Thanks, Romaine, for this detailed and careful analysis of the situation. I think much of this is spot-on. I think one of the main insights here is that we need more uniformity. Wikidata in many places is still used like some exotic "structured" format for entering plain texts, which make sense to human readers but prevent or confuse automated usage. The key is to "see" collections of items rather than single pages. It seems Wikidata would need more stakeholder communities for specific areas (say sports events) to oversee and guide the modeling of the items in this kind. We need more WikiProjects. Regarding the question whether solutions need to be technical or social, I'd say both must go together. I also have often been disheartened by the sheer effort that it would require to add even the most obvious statements to a larger set of items. Geography is a good example: there are so many nearby places that share the same geo-administrative history (take a look at the country, P17, of Dresden, Q1731), yet it is practically impossible to add this to any significant amount of the thousands of Germany cities ... Here, like in many of the cases Romaine has described, the technical limitations may smother necessary community activity. (The specific case might also be an example of something where an approach of "data sharing" is needed, i.e. a modeling paradigm that simply allows us to say "this place has the same history of P17 statements as this other place"; but that's not the main topic of this post). New tools may also enable and encourage communities to grow that have not formed in the past decade. One aspect here might be that it is difficult for communities to appreciate the result of their efforts. For example, it is very difficult to create a uniform appearance for a group of pages, already since the order of statements (in a group of the same property) is so hard to change, and also since the pages are already very long. Even if one can achieve complete semantic uniformity, one will not currently have much opportunity to "see" this success. There are unsolved challenges here that cannot be compared with the relatively simple and small data that one can find in a typical Wikipedia Infobox. External developers and maybe even researchers could contribute here, but they would also benefit form the input and concrete ideas from WikiProjects (Romain's email already had quite a number of directly implementable ideas in it ... this kind of constructive input is already half of the solution). Cheers, Markus On 31/10/2022 23:40, Romaine Wiki wrote: Yesterday it was 10 years ago when Wikidata was founded and two weeks ago Wikidata reached the amount of 100 million items. This is a good moment to see what we have (and don't have), to look a bit back, and also some hope for the future. The idea to describe this already started in September and since then I have done various analysis to get a picture. This, however, will not be a complete overview as there are too many factors involved, just a general picture of what I came across. (Spoiler: This e-mail gets more structure further below. :-p) == Structured? == Wikidata, it is said it contains structured data. I think we need to be more precise with it: it is how the data is stored that is structured. And this structured data is _only_ present on an individual item. If we zoom out a little bit, and view multiple items of a serie, among items the data is often missing, fragmented, differently organised, and sometimes even problematic. On a multi-item-level (serie-level) it highly depends if a user has done all the work to synchronise the various items all together or not. *Example:* I came across a serie of items about a certain sports tournament with an edition organised each year for 50 years on a row. For P31 (instance of), on 5 items it was called an event, on 25 items it was called a sporting event, on on 13 items a tournament, on some others a competition, and a few without P31. To be clear, each edition had the same setup, was for the same sport, everything the same. The articles on Wikipedia are better structured! This is just a simple serie of items. Zooming out another level, the differences between series are huge, which makes the quality low. How is a new item added? In the past ten years many items have been added with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here other additions.) In future still many items will be created when an article on Wikipedia has been created. In the worst case, the user adds the sitelink and the items stays empty (practically useless!). A little bit better, the user adds P31/P279 (instance of/subclass of) (not useful, but it helps). A bit more better, also other statements are added (an item becomes useful). Better when a user checks one/two other items in a series. Much better when a user checks all items
[Wikidata] Re: State of the (Wiki)data
I agree with all these criticisms of the information in Wikidata. There are quite a few important classes in Wikidata where there are missing, questionable, or incorrect structural data. Look at colors (instances of Q1075), where some colors are both instances and subclasses of color; or ships (instances of Q11446), where some ships are subclasses of ship; or the superclasses of geographic region (Q82794), which include set; or the instances of woman (Q467), of which there are only 28. I believe that these structural problems in Wikidata are a major, probably the major, reason that Wikidata does not have considerably more uptake than it currently does. Certainly every time I think of using Wikidata I have to think hard about what I need to do to ensure that the structural problems in Wikidata will not pose too much of a problem for my use. (In most cases I come to the reluctant conclusion that they will.) It's not so much that there are examples of bad structural data, it is that examples are so easy to find. And it's not so much that the problems arise from bad policies, it is that there are no enforced policies. And it's even not so much that these are unknown problems as most of them have been previously reported. It is for the above reasons that I believe that lack of tool support is not the major driver of the problems, and certainly tools that can only point out problems are not going to be a significant help in solving the problems. Instead I believe that what is driving the structural problems with Wikidata is that there is insufficient effort paid by the Wikidata community to identify and implement fixes for the structural problems. Tool support is important, I agree, but without people in the Wikidata community putting a higher priority on fixing data in Wikidata than even adding more data to Wikidata structural problems will continue. I also feel that it does very little good to ask people who are adding new data to Wikidata to only create data with good structure when there are so may existing problems. Instead the existing problems first need to be fixed up. This will both show that the Wikidata community cares about good structure and show people who are adding new data how new data should be added instead of the current situation which in too many cases provides examples of how not to structure data. Consider a tool that retrieves items that are similar to an item being added. If this comparison item has bad structuring nearby it is very likely that the new item will be either given similar or linked to the existing bad structuring. As far as labels, descriptions, and aliases go I agree that the current situation is poor. But what I believe is missing most is enough description that the intent of an item, particularly a class, can be correctly determined. I often end up with only a poor idea of what items should be an instance of a class, particularly when considering several classes at once. The various geographic classes are a prime example here for me. In my view many of the natural language information associated with Wikidata items should be tagged with the English Wikipedia multiple issues template. Queries that show the above problems: SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q1075. ?item wdt:P279* wd:Q1075. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } SELECT ?item ?itemLabel WHERE { ?item wdt:P31 wd:Q11446. ?item wdt:P279* wd:Q11446. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } SELECT ?item ?itemLabel WHERE { wd:Q82794 wdt:P279* ?item . SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } SELECT ?item ?itemLabel WHERE { ?item wdt:P31/wdt:P279* wd:Q467. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } Peter F. Patel-Schneider ___ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/GERAOWK3O56Z2YY4KHGZO4IGCXXXZK32/ To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
[Wikidata] Re: State of the (Wiki)data
Reading through all this carefully and taking notes along the way it appeared to me that ShEx (and better easier tooling for it) could help in about 50% of your future wants/needs. Great thoughts and thanks for sharing! On Tue, Nov 1, 2022 at 6:41 AM Romaine Wiki wrote: > Yesterday it was 10 years ago when Wikidata was founded and two weeks ago > Wikidata reached the amount of 100 million items. This is a good moment to > see what we have (and don't have), to look a bit back, and also some hope > for the future. > > The idea to describe this already started in September and since then I > have done various analysis to get a picture. This, however, will not be a > complete overview as there are too many factors involved, just a general > picture of what I came across. > > (Spoiler: This e-mail gets more structure further below. :-p) > > == Structured? == > > Wikidata, it is said it contains structured data. I think we need to be > more precise with it: it is how the data is stored that is structured. And > this structured data is *only* present on an individual item. If we zoom > out a little bit, and view multiple items of a serie, among items the data > is often missing, fragmented, differently organised, and sometimes even > problematic. On a multi-item-level (serie-level) it highly depends if a > user has done all the work to synchronise the various items all together or > not. > > *Example:* I came across a serie of items about a certain sports > tournament with an edition organised each year for 50 years on a row. For > P31 (instance of), on 5 items it was called an event, on 25 items it was > called a sporting event, on on 13 items a tournament, on some others a > competition, and a few without P31. To be clear, each edition had the same > setup, was for the same sport, everything the same. The articles on > Wikipedia are better structured! > > This is just a simple serie of items. Zooming out another level, the > differences between series are huge, which makes the quality low. > > How is a new item added? In the past ten years many items have been added > with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here > other additions.) In future still many items will be created when an > article on Wikipedia has been created. In the worst case, the user adds the > sitelink and the items stays empty (practically useless!). A little bit > better, the user adds P31/P279 (instance of/subclass of) (not useful, but > it helps). A bit more better, also other statements are added (an item > becomes useful). Better when a user checks one/two other items in a series. > Much better when a user checks all items of the row of subjects. And > fantastic when a user checks all items in a series and in other series. > > Realistic for most new items? No, this is way too much effort. At the same > time, to get quality data, it is needed. > > *Example:* About a month ago there were 13 000 items with a sitelink to > the Dutch Wikipedia without the basic statements P31/P279. This is just one > language version, we have hundreds of wikis! > > After some time after a new article has been written, users use a bot/tool > to mass import new articles from Wikipedia to Wikidata with zero/little > statements. We should be happy that they do this work, but these items are > largely empty and do not contain useful/needed data. Also many duplicates > are created this way. We need to go to the source and find a solution > there, re-thinking the workflow, otherwise we keep mopping with the tap > open. > > *Needed for the future:* a "new article to Wikidata wizard". I imagine > that when a user is ready with writing an article, he clicks on Publish > page. As soon as the page is saved the user gets a pop-up dialogue. The > user is first asked (in the dialogue) to search in Wikidata to see if > already an item exists about this subject. With a completely new subject or > empty item, the second step is that the dialogue suggests (based on the > published article) a few statements the user can click and confirm. Most > new articles are about subjects that are part of some sort of series or > about a subject with a default set of properties we expect to be always > present (like a building: country, located in the administrative > territorial entity and coordinates). > > I think we can be more precise about what Wikidata contains: it contains > chaotic data in a structured way, which is often not structurally added nor > maintained. > > To get more quality, we not only must have the data structured on items > and among items, but also the way how we think about working with the data > needs more structure. We currently work with individual items, and without > an integral perspective on the data: we have no overview. > > > == Wikidata gives no overview == > > I sometimes heard users say that Wikidata can provide an overview. That is > however not true. Wikidata does not give an overview! Wikidata can't give > an