Jheald added a subscriber: Jheald. Jheald added a comment. I've bent Lydia's ear a couple of times in the past on this, once in Amsterdam and once in Paris on the way back from the boat reception.
Categories have developed for a reason. There's a real value in having groupings of content (whether articles or Commons images) into groups of a human-manageable size of say 20 to 200 items, with the groups arranged in a curated hierarchical structure. IMO the category view may be particularly valuable for images, where there is real value being able to scroll down a group of about that size (so a degree of specificity giving a group of about that size) of images on a particular topic. But it also goes for articles too, to find related material: there is value in being able to see together a group of a particular size of possible related content. Too big a group (too little specificity) and it's overwhelming; but too small a group (too much specificity) and it becomes too 'bitty', and you don't see enough options to find the article you want or get an idea of the level of context and coverage all in one place, without it being broken into endless little bits. So there is value in the category system, which has tended to refine the degree of specificity to a particular sweet spot, that is an appropriate sized group -- not too big, not too little -- to be of most value to a human reader. There's also great value in the knowledge that is stored in that hierarchical structure -- you won't go to a conference with researchers who have used Wikipedia without there being at least someone who has mined our category system for groups and relationships. And we're using it ourselves for Wikidata, as one of the key sources people mine to systematically give https://phabricator.wikimedia.org/P31 values and key properties to items, far far too many of which are still currently not specified. But the category system currently has great weaknesses too. Firstly, because addition is manual, inclusion and coverage can be haphazard; and because the structure is organic and somewhat arbitrary, even finding the correct category to put something in can be unpredictable, time-consuming and onerous. Second, from an information-mining perspective, there is a difficulty because of a lack of transitiveness: if A is a member of B, and B is a member of C, it does not necessarily follow that A matches the inclusion criteria of C. As a result, a downward exploration of the hierarchy doesn't have to go very far to find category contents wheeling off in all manner of strange and unexpected directions, quite incompatible with what was originally sought to be harvested. These are general problems of the category system, as applicable to Wikipedia as to Commons. So perhaps I should not be adding this to a Commons-specific Phabricator item. I tend to agree with Lydia that structured items for Commons categories should mostly stay on Commons, attached to a particular Commons page, or however the Commons wikibase is structures, rather than the main Wikidata. But I think the issues are the same for both -- IMO the Commons and Wikipedia category issues exactly parallel each other -- and also the value of what Wikidata (or structured data) can bring, to preserve category-like views, but to make them actually work much better -- both for humans and for machines. Since they are so parallel, in what follows I'll discuss in terms of items on Wikidata and corresponding articles and categories on Wikipedias, but the translation to Commons should be straightforward. So here goes. If we're going to make categories work better, the first thing to do is to work out what is in them at the moment, and document it. It turns out that we actually already have the Property to do it: Property 360 "is a list of", as for example demonstrated in action on Q15832361, List of women engineers <https://www.wikidata.org/wiki/Q15832361> The syntax for P360 exactly mirrors the properties that items to be included in the list or category should have, as inclusion criteria -- and in particular highlights the https://phabricator.wikimedia.org/P31 which defines what sort of fundamental object they are. Filling out a P360 for each category solves the problem of transitivity -- because with the explicit inclusion criteria, it is easy for a crawler to identify eg when the downward sequence of categories ceases to be ever more refined groups with eg https://phabricator.wikimedia.org/P31=battle, and instead turns into a category about a specific battle, with its commanders etc -- where the battle in question has become a property of the inclusants, rather than their fundamental https://phabricator.wikimedia.org/P31 being a battle. Having P360 in place also means that (if desired) the category-view could be auto-augmented, with objects whose items match the criteria, even if they haven't got the category line in their wikitext. (Magnus's Reasonator already does this, to predict what items ought to be in list articles). Allowing users to turn this on, on a per-category basis, would hugely help the problem of comprehensiveness. Category views would then be as comprehensive as the wikidata (much easier to systematically interrogate, and compute intersections for), without all the fiddling business of having to get category names by hand. Also, the direct converse, one could organise a constraint violation for all items included in a category (on a particular wiki) that apparently did *not* match the inclusion criteria defined in its P360 (or P360s -- there might be alternative sets of criteria acceptable). Such constraint violations might efficiently indicate either a missing property on the item, eg a missing https://phabricator.wikimedia.org/P31 (still a big problem for us); or an additional set of acceptable criteria; or an incorrect item sitelink to identify the category from that particular language; or an item that should not be in the category -- the machine interpretability would allow these mismatches to be highlighted straightforwardly. I should at this point note that P360 is currently labelled "is a *list* of". However, GerardM has now already filled it in for 2000 categories, from a recent start, and discussions at Project chat <https://www.wikidata.org/wiki/Wikidata:Project_chat#.22is_a_list_of.22_on_categories> and this PfD <https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#.7B.7BPfD.7CProperty:P971.7D.7D> appear to agree that this is appropriate, and far better model for taking this further than the vague P971 "category combines topics". There are a few more wrinkles about what does and does not tend to get included in categories, that are worth thinking about, to make the system above work. a) Sometimes a category contains key articles that do not match the inclusion criteria. For example, usually if there is a survey article directly on the topic of the category, that will be included, usually at the top of the article, eg under the alphabetisation "*", separated off from the regular articles. So a category "20th-century painters" might include a category lead article "Painters of the 20th century" -- even though that is not an article about a painter. Fortunately we already have a property to identify such a lead article for a category -- P301, "category's main topic". However, there may be other such articles -- eg "List of 20th century painters" -- that are typically included in such a category. If these were indicated as values of a new property "Category auxiliary article", then they could be included appropriately in an automatically generated category view, despite not meeting the main criteria, or excluded from constraint violation reports. b) One other thing to recall is that categories typically do not directly include objects that are included in sub-categories. So one other new property is also needed, a property "category is a sub-category of", to record that the current category A is a sub-category of Category X; so that in any auto-generated (or auto-augmented view of the parent category X, all sub-categories with this property can be identified, and all items satisfying the inclusion criteria of any of the sub-categories can be excluded from the generated category view. (Which is a slightly involved search, but shouldn't be beyond whatever query engine ultimately gets specified). One further wrinkle is that different languages have different category structures -- so, as a qualifier on the property, one would also need to record in which languages Category A was a sub-category of Category X -- in other languages it might be a sub-category of something else. But that is fine. This I think bothered you Lydia, when you noted that categories are "completely different across languages/cultures" -- but what is recorded on Wikidata does not have to be *normative* -- it's not telling every language how things ought to be subcategorised, the one true revealed way -- rather, the subcategory property would be *descriptive* -- recording how things actually have been categorised in all the various different languages, without any judgement at all. In this way the view of the contents of a particular category would be wiki-specific (just as it is now). Each language would be different, and Commons different again. But that's not a problem. And having the union of all those patterns stored on Wikidata (or rather, the information behind the different categorisations stuctures) would actually be a boon to researchers, and to crawlers, which could then actually crawl the category tree in all languages at once -- one more example of the internationalisation achieved by Wikidata. So that's how I think categories should go forward. What would that translate into, in practical terms? - Filling out the P360s probably needs to be done by hand -- but GerardM is already showing what can be achieved. - Adding the information as to what is a sub-category of what probably should be a big central batch update -- with ongoing mechanisms to keep it synchronised with categorisation changes on the different wikis in the different languages. But there are real advantages to the sub-category appearing to be "just another property", and queryable as such, so that people can include it in a query like anything else, with the same syntax for anything else -- even if in reality it was actually a "virtual property", being generated on the fly from the main SQL tables. - Constraint-violation reports -- should not be a problem to generate, even right away, as soon as P360s are in place; though they might need some scripting, to pull together Wikidata properties with wiki data that is currently sitting in SQL tables. Can GerardM already do this? - Auto-augmentation of category views -- needs a bit of thought, but shouldn't be prohibitive, even soon? - Other category goodies -- looking again at how categories are generated and presented, going to a more dynamic model, might allow other goodies to be reasonably easily incorporated -- eg alternative possible sort orders for the results (the sequence of item-properties for each one specified e.g. as the value of a property of a category item; or by that category-item-property holding a reference to a standard order, defined on another item, eg Property:Sort_option = Q{Sort category by date}). - One could also imagine adding filtering options to a category view -- eg to only show the best image, if different images all have the same statements on them (suppress alternated/duplicates). So I think there's a lot that Wikidata could bring to category views; but the start is to understand what is already there, by getting P360s in place. TASK DETAIL https://phabricator.wikimedia.org/T87686 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Jheald Cc: Jheald, Lydia_Pintscher, Tgr, Qgil, Aklapper, Jdlrobson, Sylvain_WMFr, AlexWang, Wikidata-bugs, Daniel_Mietchen, Ricordisamoa, Fabrice_Florin, Gilles _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
