Reading through all this carefully and taking notes along the way it appeared to me that ShEx (and better easier tooling for it) could help in about 50% of your future wants/needs.
Great thoughts and thanks for sharing! On Tue, Nov 1, 2022 at 6:41 AM Romaine Wiki <[email protected]> wrote: > Yesterday it was 10 years ago when Wikidata was founded and two weeks ago > Wikidata reached the amount of 100 million items. This is a good moment to > see what we have (and don't have), to look a bit back, and also some hope > for the future. > > The idea to describe this already started in September and since then I > have done various analysis to get a picture. This, however, will not be a > complete overview as there are too many factors involved, just a general > picture of what I came across. > > (Spoiler: This e-mail gets more structure further below. :-p) > > == Structured? == > > Wikidata, it is said it contains structured data. I think we need to be > more precise with it: it is how the data is stored that is structured. And > this structured data is *only* present on an individual item. If we zoom > out a little bit, and view multiple items of a serie, among items the data > is often missing, fragmented, differently organised, and sometimes even > problematic. On a multi-item-level (serie-level) it highly depends if a > user has done all the work to synchronise the various items all together or > not. > > *Example:* I came across a serie of items about a certain sports > tournament with an edition organised each year for 50 years on a row. For > P31 (instance of), on 5 items it was called an event, on 25 items it was > called a sporting event, on on 13 items a tournament, on some others a > competition, and a few without P31. To be clear, each edition had the same > setup, was for the same sport, everything the same. The articles on > Wikipedia are better structured! > > This is just a simple serie of items. Zooming out another level, the > differences between series are huge, which makes the quality low. > > How is a new item added? In the past ten years many items have been added > with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here > other additions.) In future still many items will be created when an > article on Wikipedia has been created. In the worst case, the user adds the > sitelink and the items stays empty (practically useless!). A little bit > better, the user adds P31/P279 (instance of/subclass of) (not useful, but > it helps). A bit more better, also other statements are added (an item > becomes useful). Better when a user checks one/two other items in a series. > Much better when a user checks all items of the row of subjects. And > fantastic when a user checks all items in a series and in other series. > > Realistic for most new items? No, this is way too much effort. At the same > time, to get quality data, it is needed. > > *Example:* About a month ago there were 13 000 items with a sitelink to > the Dutch Wikipedia without the basic statements P31/P279. This is just one > language version, we have hundreds of wikis! > > After some time after a new article has been written, users use a bot/tool > to mass import new articles from Wikipedia to Wikidata with zero/little > statements. We should be happy that they do this work, but these items are > largely empty and do not contain useful/needed data. Also many duplicates > are created this way. We need to go to the source and find a solution > there, re-thinking the workflow, otherwise we keep mopping with the tap > open. > > *Needed for the future:* a "new article to Wikidata wizard". I imagine > that when a user is ready with writing an article, he clicks on Publish > page. As soon as the page is saved the user gets a pop-up dialogue. The > user is first asked (in the dialogue) to search in Wikidata to see if > already an item exists about this subject. With a completely new subject or > empty item, the second step is that the dialogue suggests (based on the > published article) a few statements the user can click and confirm. Most > new articles are about subjects that are part of some sort of series or > about a subject with a default set of properties we expect to be always > present (like a building: country, located in the administrative > territorial entity and coordinates). > > I think we can be more precise about what Wikidata contains: it contains > chaotic data in a structured way, which is often not structurally added nor > maintained. > > To get more quality, we not only must have the data structured on items > and among items, but also the way how we think about working with the data > needs more structure. We currently work with individual items, and without > an integral perspective on the data: we have no overview. > > > == Wikidata gives no overview == > > I sometimes heard users say that Wikidata can provide an overview. That is > however not true. Wikidata does not give an overview! Wikidata can't give > an overview itself, but a tool can create an overview with the use of data > from Wikidata. > > To get more quality on Wikidata, more overview is needed. Overview over > what is missing but should have been added on every item of a series. > Overview over what unexpected use of properties can be found in a series of > items. Tools that currently exist are especially good in detecting what > data has been added, but not what data is missing or is weird for this type > of item. > > *Needed for the future:* a tool "get me more like this item", but I > prefer to call it a "smart tool". When looking at an item, I often find > myself wondering about what other items of this series has as statements.If > a series contains 50(+) items, I have to open every single item to see if > anything weird is going on or anything is missing in these items. I wish I > could press a button "get me more like this". The tool then shows a full > series of items with the same label (like only the year changes) (but also > takes into account the labels in multiple languages at the same time) and > or with the same description and or with the same/similar properties. The > tool gives suggestions what to include, but it is also possible to indicate > that the tool should ignore certain things. In this way I can easily find a > certain sports tournament with 50 editions the past 50 years. And then > includes also those editions of that tournament that have no article (on > WP) in my native language (and thus no label in my language), but have an > edition in for example in the Italian WP. The tools shows all the > properties added, without having to indicate myself which properties should > be shown, and can show the labels and descriptions in multiple languages. > > > > == Labels, descriptions and aliases == > > If I have to describe one of the main things I do on Wikidata it would be > fixing language. The number one thing to fix are capitals -> lower case. I > click edit, change the capital of the label, change the capital of the > description (if it is only one), and often changing the capitals from all > the aliases, click save. This sounds not much work, but with visiting 100 > items, it becomes a lot of work. And this was just one language, often I > fix it for English and French too. Can't this be made easier? > > *Needed for the future:* a tool with what I can fix capitals in one > click. In 99,9% of the cases they are capitals that need to be fixed to > lower case. Especially ligatures take a lot of work. If someone works on > this, take into account the ligatures > <https://en.wikipedia.org/wiki/Ligature_(writing)> and for Dutch also IJ > -> ij. > > *Needed for the future:* a Wikidata game that can easily find items where > capitals are used while it should be lower case. > > With many subjects the labels and descriptions are all right or all wrong > if it comes to capitals. One group of subjects is more challenging, but in > number as in the combination lower case/capitals: taxons. Many labels got > imported from Wikipedia. In Dutch for example, the local names should be > lower case and the scientific names with a capital. This is currently a big > mess on thousands of items. > > > Another thing I have to fix frequently are dots in descriptions. > Apparently some users like to use a dot in there, while they shouldn't. > Finding the places where this took place is very hard... > > *Needed for the future:* being able to run a query on the labels, > descriptions and aliases. Many errors and issues can be find in their and > need to get solved, but finding them is not easy. I recently came across a > series of items with a spelling error. > > > Did you know that there are more than 20 places in the world that are > called Amsterdam? How useful is then a description "building in Amsterdam"? > Yes, a large number of users find it too much work to add the country of > where a certain item is located. > > *Needed for the future:* a tool/query with what I can quickly get an > overview of all the descriptions that doesn't contain a country. > > *Needed for the future:* a Wikidata game that gives me descriptions > without country while they should have one. > > We have arrived at useful labels and descriptions. A lot of work needs to > be done in that field. Many subjects do not have a unique label as there > are other subjects with the same name. To select the right item, a > description is needed to clarify the context of that item. > > *Needed for the future:* a Wikidata game that can generate descriptions. > For many items the description can simply be <subject =P31> in > <location/administrative territorial entity =P131/P276>, <country> At the > same time this can be added in your local language as in English, so > everyone knows what the topic is about. (Bonus: there are still items that > do not contain a country, maybe something to be fixed right away?) > > A Wikidata game can help to find items with missing labels/languages, but > it should be possible to simple query these. > > On the other hand, I also have came across items with many wrong > descriptions, especially "Wikimedia category" and "Wikimedia disambiguation > page". Sometimes this can't be simply reverted causing a lot of manual > labour. On a recent occasion it took 50 minutes > <https://www.wikidata.org/w/index.php?title=Q89509298&action=history> to > get the page saved! > > *Needed for the future:* a tool instantly removes in one item all the > labels of disambiguation pages, Wikimedia category or Wikimedia list > article. > > > Having at least a label in English is very welcome, otherwise there is no > clue what Q1234567 is about. There are bots who add missing labels, > including copying the page names from Commons. The sitelinks on Commons are > often Commons categories that are connected to items about that individual > subject. The bots adding the missing labels sadly also copy the prefix > Category:when entering the labels, which is often wrong. Simple solution: > Only add the Category: prefix if P31 has Wikimedia category as statement. > > > I personally think that the biggest weakness of Wikidata are the missing > labels, and then in particular the missing labels in English. If an item > has no label at all, it is basically useless. If an item only has a label > on a local language (and not English), it only can be used in that local > language which is a minority of the world. At the same time, while in many > countries most people speak also English next to their local language, in > many other countries this is not the case and people don't understand > English. This is a matter of accessibility and therefore it has priority. > > *Needed for the future:* a program to get for (almost) all items a label > available in English + translations of this label in many local languages. > > *Needed for the future:* the minimal requirement for batch uploads that > they contain at least a label in English. > > *Needed for the future:* a tool that helps with translations. There are > many items with the same name. Currently we have to add a translation to > each single item. A tool would be handy to find all items with the same > name in English (like: Saint Servatius Church), and then being able to add > a translation only once which the tool add to all the items. > > *Needed for the future:* a tool that can do transliteration. > Transliteration is a huge barrier for the usability of the data, as many > labels are only added in one script, while the user uses natively another > script. This especially involves names. > > Especially smaller language communities have a hard time on Wikidata. A > small language community means that only a very limited part of the > (essential) items on Wikidata gets translated into their local language. At > the same time, if you work on adding statements to Wikidata (while being a > non-English speaker), you highly depend on translations being available in > your own language. If something has no label in your local language -> it > will not be found when searching with the local word -> you can't add a > statement or you likely add a wrong statement. > > Without any statements an item is practically useless, so various users > are searching for items without any statements to add them. While doing > that, I recently came across a few items with only a sitelink to a > Wikipedia language version. This language is not available in Google > Translate or any other translation tool I could find, resulting in that > nothing could be done with these items. > > > == Statements == > > Every item on Wikidata should have have at least a statement instance of > or subclass of (P31/P279) (or both), because these two properties define > what the item is about. Without these properties, we are practically blind. > It is great to see some of you are working on getting all the items to have > these properties on them. (I recently completed that for all items with a > sitelink to nlwiki: 23 373x -> 0x.) More help is needed for the many other > items! > > *Needed for the future:* a Wikidata game that brings up items without > P31/P279 and gives suggestion(s) to add. > > While doing the project of adding P31/P279, I noticed that various users > still do not understand the difference between these two properties. This > means that on various items users have added the P31/P279 wrong. We need to > think on how we can find the items where that is the case and fix those. > There is also a grey zone: a series of items for what it is not precisely > clear whether it should be. Perhaps a project who can take care of those > cases? > > In addition to P31/P279, each theme of items has a fixed set of basic > statements that always should have been added. For example with taxon as > P31, also needed are scientific name (P225), taxon rank (P105), and parent > taxon (P171), For example with building as P31, also needed are country > (P17), administrative territorial entity (P131), and coordinates (P625). > This seems pretty obvious, but a recent large data import still forgot > some basic statements, which still hasn't been fixed. The goal of adding > data is that the data can be used. By having some basic statements missing, > the quality becomes too low. I think such data imports should not be > allowed. > > *Needed for the future:* a program/project to get for (almost) all items > the basic statements present. > > We then probably should also have attention for the quality. For example > the administrative territorial entity (P131) should not be too generic. A > village or a building should get as P131 the smallest administrative > territorial entity as possible (in many countries the municipality). > > > With the various fixed sets of basic statements I estimate that it is > possible to cover at least 90% of all the items (taxa, geographic features, > people, structures, astronomical objects, publications, etc.). The > remaining ones are harder, often more specialistic. Those have often for > P31 "term" or "concept". To make those items more useable these items need > to get more statements that provide a better context. Then properties like > "aspect of" (P1269) and "characterized by" (P1552) are needed. > > > == Identifiers == > > About identifiers can't be said much: they do what do have to do. Even > within Wikidata they help a lot, as a symbol is shown when the same > identifier has been used on an other item en thus solving duplicates. The > focus where most identifiers seem to be related to are sports, popular > culture (music, movies, etc) and monument identifiers. In many other fields > no properties for identifiers have been created yet. > > For a generic user, most work regarding identifiers is in finding out how > to find the specific identifier on which website so it can be added to the > item. I think there we should have more attention for so we can make it > easier for users to add them. Another thing is that for the theme I am > working on, it is not easy to see where identifiers are missing but do > likely exist. > > *Needed for the future:* a tool that lists all potential items (in > general or of a set of items/query) where an identifier likely is missing. > > > If identifiers are added to items, an icon next to it often indicates > which statements are missing on that same item. For example, if I add a > monument identifier, it also indicates that for example a country (P17) > have been added. A great help to get items more complete. At the same time > identifiers are often forgotten. The other way round would be welcome too: > when an item gets for example building als P31, it should also suggest to > add a country (P17), an administrative territorial entity (P131), > coordinates (P625), and perhaps more. > > > == Other == > > Besides the ones already mentioned there are some tools/software/issues > that would make the work easier or need to be solved. > > *Needed for the future:* a tool that looks up all coordinates nearby > certain coordinates. Like the Special:Nearby, but then any given location. > > *Needed for the future:* better suggestions when adding statements. For > example, when I added bridges (those things to cross a river), I get > suggestions for properties related to astronomy. When an item has a > Wikipedia article as sitelink, it would be great if a statement suggester > would use the Wikipedia article to give suggestions. For example, why do I > have to indicate manually the country (again) if this already has been > indicated twice in the Wikipedia article? > > > > Ten years ago Wikidata started. Those years past by quickly. We all > together have put so much work in it with a great result as outcome. But we > are not ready yet. For the next ten years I expect our main focus to be > improving the quality. > > > _______________________________________________ > Wikidata mailing list -- [email protected] > Public archives at > https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/QNOSK7Z6VN7SBUV42A26IQZ4U72LFEAQ/ > To unsubscribe send an email to [email protected] > -- Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
_______________________________________________ Wikidata mailing list -- [email protected] Public archives at https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/HAVAELSUVW5ZAROICKJEM6LYTNPU4WD6/ To unsubscribe send an email to [email protected]
