[Wikidata] Re: State of the (Wiki)data

Thad Guidry Mon, 31 Oct 2022 23:39:50 -0700

Reading through all this carefully and taking notes along the way it
appeared to me that ShEx (and better easier tooling for it) could help in
about 50% of your future wants/needs.


Great thoughts and thanks for sharing!

On Tue, Nov 1, 2022 at 6:41 AM Romaine Wiki <[email protected]> wrote:

> Yesterday it was 10 years ago when Wikidata was founded and two weeks ago
> Wikidata reached the amount of 100 million items. This is a good moment to
> see what we have (and don't have), to look a bit back, and also some hope
> for the future.
>
> The idea to describe this already started in September and since then I
> have done various analysis to get a picture. This, however, will not be a
> complete overview as there are too many factors involved, just a general
> picture of what I came across.
>
> (Spoiler: This e-mail gets more structure further below. :-p)
>
> == Structured? ==
>
> Wikidata, it is said it contains structured data. I think we need to be
> more precise with it: it is how the data is stored that is structured. And
> this structured data is *only* present on an individual item. If we zoom
> out a little bit, and view multiple items of a serie, among items the data
> is often missing, fragmented, differently organised, and sometimes even
> problematic. On a multi-item-level (serie-level) it highly depends if a
> user has done all the work to synchronise the various items all together or
> not.
>
> *Example:* I came across a serie of items about a certain sports
> tournament with an edition organised each year for 50 years on a row. For
> P31 (instance of), on 5 items it was called an event, on 25 items it was
> called a sporting event, on on 13 items a tournament, on some others a
> competition, and a few without P31. To be clear, each edition had the same
> setup, was for the same sport, everything the same. The articles on
> Wikipedia are better structured!
>
> This is just a simple serie of items. Zooming out another level, the
> differences between series are huge, which makes the quality low.
>
> How is a new item added? In the past ten years many items have been added
> with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here
> other additions.) In future still many items will be created when an
> article on Wikipedia has been created. In the worst case, the user adds the
> sitelink and the items stays empty (practically useless!). A little bit
> better, the user adds P31/P279 (instance of/subclass of) (not useful, but
> it helps). A bit more better, also other statements are added (an item
> becomes useful). Better when a user checks one/two other items in a series.
> Much better when a user checks all items of the row of subjects. And
> fantastic when a user checks all items in a series and in other series.
>
> Realistic for most new items? No, this is way too much effort. At the same
> time, to get quality data, it is needed.
>
> *Example:* About a month ago there were 13 000 items with a sitelink to
> the Dutch Wikipedia without the basic statements P31/P279. This is just one
> language version, we have hundreds of wikis!
>
> After some time after a new article has been written, users use a bot/tool
> to mass import new articles from Wikipedia to Wikidata with zero/little
> statements. We should be happy that they do this work, but these items are
> largely empty and do not contain useful/needed data. Also many duplicates
> are created this way. We need to go to the source and find a solution
> there, re-thinking the workflow, otherwise we keep mopping with the tap
> open.
>
> *Needed for the future:* a "new article to Wikidata wizard". I imagine
> that when a user is ready with writing an article, he clicks on Publish
> page. As soon as the page is saved the user gets a pop-up dialogue. The
> user is first asked (in the dialogue) to search in Wikidata to see if
> already an item exists about this subject. With a completely new subject or
> empty item, the second step is that the dialogue suggests (based on the
> published article) a few statements the user can click and confirm. Most
> new articles are about subjects that are part of some sort of series or
> about a subject with a default set of properties we expect to be always
> present (like a building: country, located in the administrative
> territorial entity and coordinates).
>
> I think we can be more precise about what Wikidata contains: it contains
> chaotic data in a structured way, which is often not structurally added nor
> maintained.
>
> To get more quality, we not only must have the data structured on items
> and among items, but also the way how we think about working with the data
> needs more structure. We currently work with individual items, and without
> an integral perspective on the data: we have no overview.
>
>
> == Wikidata gives no overview ==
>
> I sometimes heard users say that Wikidata can provide an overview. That is
> however not true. Wikidata does not give an overview!  Wikidata can't give
> an overview itself, but a tool can create an overview with the use of data
> from Wikidata.
>
> To get more quality on Wikidata, more overview is needed. Overview over
> what is missing but should have been added on every item of a series.
> Overview over what unexpected use of properties can be found in a series of
> items. Tools that currently exist are especially good in detecting what
> data has been added, but not what data is missing or is weird for this type
> of item.
>
> *Needed for the future:* a tool "get me more like this item", but I
> prefer to call it a "smart tool". When looking at an item, I often find
> myself wondering about what other items of this series has as statements.If
> a series contains 50(+) items, I have to open every single item to see if
> anything weird is going on or anything is missing in these items. I wish I
> could press a button "get me more like this". The tool then shows a full
> series of items with the same label (like only the year changes) (but also
> takes into account the labels in multiple languages at the same time) and
> or with the same description and or with the same/similar properties. The
> tool gives suggestions what to include, but it is also possible to indicate
> that the tool should ignore certain things. In this way I can easily find a
> certain sports tournament with 50 editions the past 50 years. And then
> includes also those editions of that tournament that have no article (on
> WP) in my native language (and thus no label in my language), but have an
> edition in for example in the Italian WP. The tools shows all the
> properties added, without having to indicate myself which properties should
> be shown, and can show the labels and descriptions in multiple languages.
>
>
>
> == Labels, descriptions and aliases ==
>
> If I have to describe one of the main things I do on Wikidata it would be
> fixing language. The number one thing to fix are capitals -> lower case. I
> click edit, change the capital of the label, change the capital of the
> description (if it is only one), and often changing the capitals from all
> the aliases, click save. This sounds not much work, but with visiting 100
> items, it becomes a lot of work. And this was just one language, often I
> fix it for English and French too. Can't this be made easier?
>
> *Needed for the future:* a tool with what I can fix capitals in one
> click. In 99,9% of the cases they are capitals that need to be fixed to
> lower case. Especially ligatures take a lot of work. If someone works on
> this, take into account the ligatures
> <https://en.wikipedia.org/wiki/Ligature_(writing)> and for Dutch also IJ
> -> ij.
>
> *Needed for the future:* a Wikidata game that can easily find items where
> capitals are used while it should be lower case.
>
> With many subjects the labels and descriptions are all right or all wrong
> if it comes to capitals. One group of subjects is more challenging, but in
> number as in the combination lower case/capitals: taxons. Many labels got
> imported from Wikipedia. In Dutch for example, the local names should be
> lower case and the scientific names with a capital. This is currently a big
> mess on thousands of items.
>
>
> Another thing I have to fix frequently are dots in descriptions.
> Apparently some users like to use a dot in there, while they shouldn't.
> Finding the places where this took place is very hard...
>
> *Needed for the future:* being able to run a query on the labels,
> descriptions and aliases. Many errors and issues can be find in their and
> need to get solved, but finding them is not easy. I recently came across a
> series of items with a spelling error.
>
>
> Did you know that there are more than 20 places in the world that are
> called Amsterdam? How useful is then a description "building in Amsterdam"?
> Yes, a large number of users find it too much work to add the country of
> where a certain item is located.
>
> *Needed for the future:* a tool/query with what I can quickly get an
> overview of all the descriptions that doesn't contain a country.
>
> *Needed for the future:* a Wikidata game that gives me descriptions
> without country while they should have one.
>
> We have arrived at useful labels and descriptions. A lot of work needs to
> be done in that field. Many subjects do not have a unique label as there
> are other subjects with the same name. To select the right item, a
> description is needed to clarify the context of that item.
>
> *Needed for the future:* a Wikidata game that can generate descriptions.
> For many items the description can simply be <subject =P31> in
> <location/administrative territorial entity =P131/P276>, <country> At the
> same time this can be added in your local language as in English, so
> everyone knows what the topic is about. (Bonus: there are still items that
> do not contain a country, maybe something to be fixed right away?)
>
> A Wikidata game can help to find items with missing labels/languages, but
> it should be possible to simple query these.
>
> On the other hand, I also have came across items with many wrong
> descriptions, especially "Wikimedia category" and "Wikimedia disambiguation
> page". Sometimes this can't be simply reverted causing a lot of manual
> labour. On a recent occasion it took 50 minutes
> <https://www.wikidata.org/w/index.php?title=Q89509298&action=history> to
> get the page saved!
>
> *Needed for the future:* a tool instantly removes in one item all the
> labels of disambiguation pages, Wikimedia category or Wikimedia list
> article.
>
>
> Having at least a label in English is very welcome, otherwise there is no
> clue what Q1234567 is about. There are bots who add missing labels,
> including copying the page names from Commons. The sitelinks on Commons are
> often Commons categories that are connected to items about that individual
> subject. The bots adding the missing labels sadly also copy the prefix
> Category:when entering the labels, which is often wrong. Simple solution:
> Only add the Category: prefix if P31 has Wikimedia category as statement.
>
>
> I personally think that the biggest weakness of Wikidata are the missing
> labels, and then in particular the missing labels in English. If an item
> has no label at all, it is basically useless. If an item only has a label
> on a local language (and not English), it only can be used in that local
> language which is a minority of the world. At the same time, while in many
> countries most people speak also English next to their local language, in
> many other countries this is not the case and people don't understand
> English. This is a matter of accessibility and therefore it has priority.
>
> *Needed for the future:* a program to get for (almost) all items a label
> available in English + translations of this label in many local languages.
>
> *Needed for the future:* the minimal requirement for batch uploads that
> they contain at least a label in English.
>
> *Needed for the future:* a tool that helps with translations. There are
> many items with the same name. Currently we have to add a translation to
> each single item. A tool would be handy to find all items with the same
> name in English (like: Saint Servatius Church), and then being able to add
> a translation only once which the tool add to all the items.
>
> *Needed for the future:* a tool that can do transliteration.
> Transliteration is a huge barrier for the usability of the data, as many
> labels are only added in one script, while the user uses natively another
> script. This especially involves names.
>
> Especially smaller language communities have a hard time on Wikidata. A
> small language community means that only a very limited part of the
> (essential) items on Wikidata gets translated into their local language. At
> the same time, if you work on adding statements to Wikidata (while being a
> non-English speaker), you highly depend on translations being available in
> your own language. If something has no label in your local language -> it
> will not be found when searching with the local word -> you can't add a
> statement or you likely add a wrong statement.
>
> Without any statements an item is practically useless, so various users
> are searching for items without any statements to add them. While doing
> that, I recently came across a few items with only a sitelink to a
> Wikipedia language version. This language is not available in Google
> Translate or any other translation tool I could find, resulting in that
> nothing could be done with these items.
>
>
> == Statements ==
>
> Every item on Wikidata should have have at least a statement instance of
> or subclass of (P31/P279) (or both), because these two properties define
> what the item is about. Without these properties, we are practically blind.
> It is great to see some of you are working on getting all the items to have
> these properties on them. (I recently completed that for all items with a
> sitelink to nlwiki: 23 373x -> 0x.) More help is needed for the many other
> items!
>
> *Needed for the future:* a Wikidata game that brings up items without
> P31/P279 and gives suggestion(s) to add.
>
> While doing the project of adding P31/P279, I noticed that various users
> still do not understand the difference between these two properties. This
> means that on various items users have added the P31/P279 wrong. We need to
> think on how we can find the items where that is the case and fix those.
> There is also a grey zone: a series of items for what it is not precisely
> clear whether it should be. Perhaps a project who can take care of those
> cases?
>
> In addition to P31/P279, each theme of items has a fixed set of basic
> statements that always should have been added. For example with taxon as
> P31, also needed are scientific name (P225), taxon rank (P105), and parent
> taxon (P171), For example with building as P31, also needed are country
> (P17), administrative territorial entity (P131), and coordinates (P625).
> This seems pretty obvious, but a recent large data import still forgot
> some basic statements, which still hasn't been fixed. The goal of adding
> data is that the data can be used. By having some basic statements missing,
> the quality becomes too low. I think such data imports should not be
> allowed.
>
> *Needed for the future:* a program/project to get for (almost) all items
> the basic statements present.
>
> We then probably should also have attention for the quality. For example
> the administrative territorial entity (P131) should not be too generic. A
> village or a building should get as P131 the smallest administrative
> territorial entity as possible (in many countries the municipality).
>
>
> With the various fixed sets of basic statements I estimate that it is
> possible to cover at least 90% of all the items (taxa, geographic features,
> people, structures, astronomical objects, publications, etc.). The
> remaining ones are harder, often more specialistic. Those have often for
> P31 "term" or "concept". To make those items more useable these items need
> to get more statements that provide a better context. Then properties like
> "aspect of" (P1269) and "characterized by" (P1552) are needed.
>
>
> == Identifiers ==
>
> About identifiers can't be said much: they do what do have to do. Even
> within Wikidata they help a lot, as a symbol is shown when the same
> identifier has been used on an other item en thus solving duplicates. The
> focus where most identifiers seem to be related to are sports, popular
> culture (music, movies, etc) and monument identifiers. In many other fields
> no properties for identifiers have been created yet.
>
> For a generic user, most work regarding identifiers is in finding out how
> to find the specific identifier on which website so it can be added to the
> item. I think there we should have more attention for so we can make it
> easier for users to add them. Another thing is that for the theme I am
> working on, it is not easy to see where identifiers are missing but do
> likely exist.
>
> *Needed for the future:* a tool that lists all potential items (in
> general or of a set of items/query) where an identifier likely is missing.
>
>
> If identifiers are added to items, an icon next to it often indicates
> which statements are missing on that same item. For example, if I add a
> monument identifier, it also indicates that for example a country (P17)
> have been added. A great help to get items more complete. At the same time
> identifiers are often forgotten. The other way round would be welcome too:
> when an item gets for example building als P31, it should also suggest to
> add a country (P17), an administrative territorial entity (P131),
> coordinates (P625), and perhaps more.
>
>
> == Other ==
>
> Besides the ones already mentioned there are some tools/software/issues
> that would make the work easier or need to be solved.
>
> *Needed for the future:* a tool that looks up all coordinates nearby
> certain coordinates. Like the Special:Nearby, but then any given location.
>
> *Needed for the future:* better suggestions when adding statements. For
> example, when I added bridges (those things to cross a river), I get
> suggestions for properties related to astronomy. When an item has a
> Wikipedia article as sitelink, it would be great if a statement suggester
> would use the Wikipedia article to give suggestions. For example, why do I
> have to indicate manually the country (again) if this already has been
> indicated twice in the Wikipedia article?
>
>
>
> Ten years ago Wikidata started. Those years past by quickly. We all
> together have put so much work in it with a great result as outcome. But we
> are not ready yet. For the next ten years I expect our main focus to be
> improving the quality.
>
>
> _______________________________________________
> Wikidata mailing list -- [email protected]
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/QNOSK7Z6VN7SBUV42A26IQZ4U72LFEAQ/
> To unsubscribe send an email to [email protected]
>
-- 
Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/

_______________________________________________
Wikidata mailing list -- [email protected]
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/HAVAELSUVW5ZAROICKJEM6LYTNPU4WD6/
To unsubscribe send an email to [email protected]

[Wikidata] Re: State of the (Wiki)data

Reply via email to