[Wikidata-bugs] [Maniphest] [Updated] T87686: Categories are metadata

Jheald Thu, 26 Feb 2015 14:30:10 -0800

Jheald added a subscriber: Jheald.
Jheald added a comment.

I've bent Lydia's ear a couple of times in the past on this, once in Amsterdam 
and once in Paris on the way back from the boat reception.


Categories have developed for a reason.  There's a real value in having 
groupings of content (whether articles or Commons images) into groups of a 
human-manageable size of say 20 to 200 items, with the groups arranged in a 
curated hierarchical structure.  IMO the category view may be particularly 
valuable for images, where there is real value being able to scroll down a 
group of about that size (so a degree of specificity giving a group of about 
that size) of images on a particular topic.  But it also goes for articles too, 
to find related material: there is value in being able to see together a group 
of a particular size of possible related content.  Too big a group (too little 
specificity) and it's overwhelming; but too small a group (too much 
specificity) and it becomes too 'bitty', and you don't see enough options to 
find the article you want or get an idea of the level of context and coverage 
all in one place, without it being broken into endless little bits.   So there 
is
value in the category system, which has tended to refine the degree of 
specificity to a particular sweet spot, that is an appropriate sized group -- 
not too big, not too little -- to be of most value to a human reader.

There's also great value in the knowledge that is stored in that hierarchical 
structure -- you won't go to a conference with researchers who have used 
Wikipedia without there being at least someone who has mined our category 
system for groups and relationships.   And we're using it ourselves for 
Wikidata, as one of the key sources people mine to systematically give 
https://phabricator.wikimedia.org/P31 values and key properties to items, far 
far too many of which are still currently not specified.

But the category system currently has great weaknesses too.  Firstly, because 
addition is manual, inclusion and coverage can be haphazard; and because the 
structure is organic and somewhat arbitrary, even finding the correct category 
to put something in can be unpredictable, time-consuming and onerous.  Second, 
from an information-mining perspective, there is a difficulty because of a lack 
of transitiveness: if A is a member of B, and B is a member of C, it does not 
necessarily follow that A matches the inclusion criteria of C.   As a result, a 
downward exploration of the hierarchy doesn't have to go very far to find 
category contents wheeling off in all manner of strange and unexpected 
directions, quite incompatible with what was originally sought to be harvested.

These are general problems of the category system, as applicable to Wikipedia 
as to Commons.  So perhaps I should not be adding this to a Commons-specific 
Phabricator item.  I tend to agree with Lydia that structured items for Commons 
categories should mostly stay on Commons, attached to a particular Commons 
page, or however the Commons wikibase is structures, rather than the main 
Wikidata.  But I think the issues are the same for both -- IMO the Commons and 
Wikipedia category issues exactly parallel each other --  and also the value of 
what Wikidata (or structured data) can bring, to preserve category-like views, 
but to make them actually work much better -- both for humans and for machines.

Since they are so parallel, in what follows I'll discuss in terms of items on 
Wikidata and corresponding articles and categories on Wikipedias, but the 
translation to Commons should be straightforward.

So here goes.

If we're going to make categories work better, the first thing to do is to work 
out what is in them at the moment, and document it.

It turns out that we actually already have the Property to do it: Property 360 
"is a list of", as for example demonstrated in action on Q15832361,
List of women engineers <https://www.wikidata.org/wiki/Q15832361>

The syntax for P360 exactly mirrors the properties that items to be included in 
the list or category should have, as inclusion criteria -- and in particular 
highlights the https://phabricator.wikimedia.org/P31 which defines what sort of 
fundamental object they are.

Filling out a P360 for each category solves the problem of transitivity -- 
because with the explicit inclusion criteria, it is easy for a crawler to 
identify eg when the downward sequence of categories ceases to be ever more 
refined groups with eg https://phabricator.wikimedia.org/P31=battle, and 
instead turns into a category about a specific battle, with its commanders etc 
-- where the battle in question has become a property of the inclusants, rather 
than their fundamental https://phabricator.wikimedia.org/P31 being a battle.

Having P360 in place also means that (if desired) the category-view could be 
auto-augmented, with objects whose items match the criteria, even if they 
haven't got the category line in their wikitext.  (Magnus's Reasonator already 
does this, to predict what items ought to be in list articles).  Allowing users 
to turn this on, on a per-category basis, would hugely help the problem of 
comprehensiveness.  Category views would then be as comprehensive as the 
wikidata (much easier to systematically interrogate, and compute intersections 
for), without all the fiddling business of having to get category names by hand.

Also, the direct converse, one could organise a constraint violation for all 
items included in a category (on a particular wiki) that apparently did *not* 
match the inclusion criteria defined in its P360 (or P360s -- there might be 
alternative sets of criteria acceptable).  Such constraint violations might 
efficiently indicate either a missing property on the item, eg a missing 
https://phabricator.wikimedia.org/P31 (still a big problem for us); or an 
additional set of acceptable criteria; or an incorrect item sitelink to 
identify the category from that particular language; or an item that should not 
be in the category -- the machine interpretability would allow these mismatches 
to be highlighted straightforwardly.

I should at this point note that P360 is currently labelled "is a *list* of".  
However, GerardM has now already filled it in for 2000 categories, from a 
recent start, and discussions at
Project chat 
<https://www.wikidata.org/wiki/Wikidata:Project_chat#.22is_a_list_of.22_on_categories>
 and this PfD 
<https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#.7B.7BPfD.7CProperty:P971.7D.7D>
 appear to agree that this is appropriate, and far better model for taking this 
further than the vague P971 "category combines topics".

There are a few more wrinkles about what does and does not tend to get included 
in categories, that are worth thinking about, to make the system above work.

a) Sometimes a category contains key articles that do not match the inclusion 
criteria.

For example, usually if there is a survey article directly on the topic of the 
category, that will be included, usually at the top of the article, eg under 
the alphabetisation "*", separated off from the regular articles.  So a 
category "20th-century painters" might include a category lead article 
"Painters of the 20th century" -- even though that is not an article about a 
painter.

Fortunately we already have a property to identify such a lead article for a 
category -- P301, "category's main topic".

However, there may be other such articles -- eg "List of 20th century painters" 
-- that are typically included in such a category.

If these were indicated as values of a new property "Category auxiliary 
article", then they could be included appropriately in an automatically 
generated category view, despite not meeting the main criteria, or excluded 
from constraint violation reports.

b) One other thing to recall is that categories typically do not directly 
include objects that are included in sub-categories.

So one other new property is also needed, a property "category is a 
sub-category of", to record that the current category A is a sub-category of 
Category X; so that in any auto-generated (or auto-augmented view of the parent 
category X, all sub-categories with this property can be identified, and all 
items satisfying the inclusion criteria of any of the sub-categories can be 
excluded from the generated category view.  (Which is a slightly involved 
search, but shouldn't be beyond whatever query engine ultimately gets 
specified).

One further wrinkle is that different languages have different category 
structures -- so, as a qualifier on the property, one would also need to record 
in which languages Category A was a sub-category of Category X -- in other 
languages it might be a sub-category of something else.  But that is fine.  
This I think bothered you Lydia, when you noted that categories are "completely 
different across languages/cultures" -- but what is recorded on Wikidata does 
not have to be *normative* -- it's not telling every language how things ought 
to be subcategorised, the one true revealed way -- rather, the subcategory 
property would be *descriptive* -- recording how things actually have been 
categorised in all the various different languages, without any judgement at 
all.

In this way the view of the contents of a particular category would be 
wiki-specific (just as it is now).  Each language would be different, and 
Commons different again.  But that's not a problem.  And having the union of 
all those patterns stored on Wikidata (or rather, the information behind the 
different categorisations stuctures) would actually be a boon to researchers, 
and to crawlers, which could then actually crawl the category tree in all 
languages at once -- one more example of the internationalisation achieved by 
Wikidata.

So that's how I think categories should go forward.

What would that translate into, in practical terms?

- Filling out the P360s probably needs to be done by hand -- but GerardM is 
already showing what can be achieved.

- Adding the information as to what is a sub-category of what probably should 
be a big central batch update -- with ongoing mechanisms to keep it 
synchronised with categorisation changes on the different wikis in the 
different languages.  But there are real advantages to the sub-category 
appearing to be "just another property", and queryable as such, so that people 
can include it in a query like anything else, with the same syntax for anything 
else -- even if in reality it was actually a "virtual property", being 
generated on the fly from the main SQL tables.

- Constraint-violation reports -- should not be a problem to generate, even 
right away, as soon as  P360s are in place; though they might need some 
scripting, to pull together Wikidata properties with wiki data that is 
currently sitting in SQL tables.   Can GerardM already do this?

- Auto-augmentation of category views -- needs a bit of thought, but shouldn't 
be prohibitive, even soon?

- Other category goodies -- looking again at how categories are generated and 
presented, going to a more dynamic model, might allow other goodies to be 
reasonably easily incorporated -- eg alternative possible sort orders for the 
results (the sequence of item-properties for each one specified e.g. as the 
value of a property of a category item; or by that category-item-property 
holding a reference to a standard order, defined on another item, eg 
Property:Sort_option = Q{Sort category by date}).

- One could also imagine adding filtering options to a category view -- eg to 
only show the best image, if different images all have the same statements on 
them (suppress alternated/duplicates).

So I think there's a lot that Wikidata could bring to category views; but the 
start is to understand what is already there, by getting P360s in place.


TASK DETAIL
  https://phabricator.wikimedia.org/T87686

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jheald
Cc: Jheald, Lydia_Pintscher, Tgr, Qgil, Aklapper, Jdlrobson, Sylvain_WMFr, 
AlexWang, Wikidata-bugs, Daniel_Mietchen, Ricordisamoa, Fabrice_Florin, Gilles



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T87686: Categories are metadata

Reply via email to