Re: [Wikitech-l] category intersection conversations

Brian Wolff Wed, 08 May 2013 21:14:05 -0700

On 2013-05-08 11:48 PM, "James Forrester" <jforres...@wikimedia.org> wrote:
>
> On 8 May 2013 18:26, Sumana Harihareswara <suma...@wikimedia.org> wrote:
>
> > Recently a lot of people have been talking about what's possible and
> > what's necessary regarding MediaWiki, CatScan-like tools, and real
> > category intersection; this mail has some pointers.
> >
> > The long-term solution is a sparkly query for, e.g., people with aspects
> > novelist + Singaporean, and it would be great if Wikidata could be the
> > data-source.  Generally people don't really want to search using
> > hierarchical categories; they want tags and they want AND. But
> > MediaWiki's current power users do use hierarchical labels, so any
> > change would have to deal with current users' expectations.  Also my
> > head hurts just thinking of the "but my intuitively obvious ontology is
> > better than yours" arguments.
> >
>
> To put a nice clear stake in the ground, a magic-world-of-loveliness
> sparkly proposal for 2015* might be:


Just to clarify, you mean sparkles in the way that a unicorn sparkles as
its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store
based)?

>
> * Categories are implemented in Wikidata
> * -> They're in whatever language the user wants (so fr:Chat and en:Cat
and
> nl:kat and zh-han-t:貓 …)

Issue (probably can be dealt with somehow or maybe rare enough not to
care): conflicts - what if the name of one cat in french is the same as a
different category in spanish. May be non issue if done using wikidata
numeric ids

> * -> They're properly queryable

Various groups have variois definitions of this

> * -> They're shared between wikis (pooled expertise)

Between wikipedias or all wikimedia wikis... category structure has varried
meaning between projects. Category:North_America has different types of
pages in enwikinews compared to enwikipedia.
>
> * Pages are implicitly in the parent categories of their explicit
categories
> * -> Pages in <Politicians from the Netherlands> are in <People from the
> Netherlands by profession> (its first parent) and <People from the
> Netherlands> (its first parent's parent) and <Politicians> (its second
> parent) and <People> (its second parent's parent) and …
> * -> Yes, this poses issues given the sometimes cyclic nature of
> categories' hierarchies, but this is relatively trivial to code around

In the current structure. It doesnt make sense for Bob to be in list of
people by professions. It makes less sense the futher you traverse the
cayegory graph. Otoh better querying capabilities may turn the category
system into more of a flat namespace making that less of an issue.

>
> * Readers can search, querying across categories regardless of whether
> they're implicit or explicit
> * -> A search for the intersection of <People from the Netherlands> with
> <Politicians> will effectively return results for <Politicians from the
> Netherlands> (and the user doesn't need to know or care that this is an
> extant or non-extant category)

We would need some system to turn fake cats into real queries. I suppose
users could make redirects. The alternative of magic nlp sounds difficult

> * -> Searches might be more than just intersections, e.g. "<Painters from
> the United Kingdom> AND <Living people> NOT <Members of the Royal
Academy>"
> or whatever.
> * -> Such queries might be cached (and, indeed, the intersections that
> people search for might be used to suggest new categorisation schemata
that
> wikis had previously not considered - e.g. <British politicians> & <People
> with pet cats> & <People who died in hot-ballooning accidents)

Dealing with cache invalidation (unless it is quite coarse grained) may be
difficult.
>
> * Editors can tag articles with leaf or branch categories, potentially
> over-lapping and the system will rationalise the categories on save to the
> minimally-spanning subset (or whatever is most useful for users, the
> database, and/or both)

That's quite an interesting idea, and one I haven't heard before from
previous times this has been brought up.

One concern id have is how to figure out which categories to list at the
bottom of the page (all that could fit, or only the base categories, and
how to determine what that is)

> * -> Editors don't need to know the hierarchy of categories *a priori*
when
> adding pages to them (yay, less difficulty)
> * -> Power editors don't need to type in loads of different categories if
> they have a very specific one in mind (yay, still flexible)
> * -> Categories shown to readers aren't necessarily the categories saved
in
> the database, at editorial judgement (otherwise, would a page not be in
> just a single category, namely the intersection of all its tagged
> categories?)
>
> Apart from the time and resources needed to make this happen and
> operational, does this sound like something we'd want to do? It feels like
> this, or something like it, would serve our editors and readers the best
> from their perspective, if not our sysadmins. :-)
>
> [Snip]
> 
>
> > I think the best place to pursue this topic is probably in
> > https://meta.wikimedia.org/wiki/Talk:Beyond_categories .  It's unlikely
> > Wikimedia Foundation will be able to make engineers available to work on
> > this anytime soon, but I would not be surprised if the Wikidata
> > developer community or volunteers found this interesting enough to work
on.
>
>
> I guess I should post this there too, maybe once someone's told me if
it's
> mad-cap. ;-)
>

I think you have captured what a lot of people want in a somewhat dreamy
sense. However there is still a lot to do to make that vision concrete. In
particular i think there would be non trivial ui challanges to make this
understandable to the user.

----

From what I hear wikidata phase 3 is going to basically be support for
inline queries. Details are vauge but if they support the typical types of
queries you associate with semantic networks - there is category
intersection right there.

If any of the wikidata folk could comment on what sort of queries are
planned for phase 3, performance/scaling considerations, technologies being
considered (triple store?) Id be very interested in hearing. (I recognize
that future plans may not exist yet)

more generally it would be interesting to know the performance
characteristics of SPARQL type query systems, since people seem to be
talking about them. Are they a non starter or could they be feasible?
Semantic and efficient are not words I associate with each other, but that
is due to rumour not actual data. (Although my brief googling doesnt
exactly look promising)

-bawolff
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] category intersection conversations

Reply via email to