On 2013-05-08 11:48 PM, "James Forrester" <jforres...@wikimedia.org> wrote: > > On 8 May 2013 18:26, Sumana Harihareswara <suma...@wikimedia.org> wrote: > > > Recently a lot of people have been talking about what's possible and > > what's necessary regarding MediaWiki, CatScan-like tools, and real > > category intersection; this mail has some pointers. > > > > The long-term solution is a sparkly query for, e.g., people with aspects > > novelist + Singaporean, and it would be great if Wikidata could be the > > data-source. Generally people don't really want to search using > > hierarchical categories; they want tags and they want AND. But > > MediaWiki's current power users do use hierarchical labels, so any > > change would have to deal with current users' expectations. Also my > > head hurts just thinking of the "but my intuitively obvious ontology is > > better than yours" arguments. > > > > To put a nice clear stake in the ground, a magic-world-of-loveliness > sparkly proposal for 2015* might be:
Just to clarify, you mean sparkles in the way that a unicorn sparkles as its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store based)? > > * Categories are implemented in Wikidata > * -> They're in whatever language the user wants (so fr:Chat and en:Cat and > nl:kat and zh-han-t:貓 …) Issue (probably can be dealt with somehow or maybe rare enough not to care): conflicts - what if the name of one cat in french is the same as a different category in spanish. May be non issue if done using wikidata numeric ids > * -> They're properly queryable Various groups have variois definitions of this > * -> They're shared between wikis (pooled expertise) Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia. > > * Pages are implicitly in the parent categories of their explicit categories > * -> Pages in <Politicians from the Netherlands> are in <People from the > Netherlands by profession> (its first parent) and <People from the > Netherlands> (its first parent's parent) and <Politicians> (its second > parent) and <People> (its second parent's parent) and … > * -> Yes, this poses issues given the sometimes cyclic nature of > categories' hierarchies, but this is relatively trivial to code around In the current structure. It doesnt make sense for Bob to be in list of people by professions. It makes less sense the futher you traverse the cayegory graph. Otoh better querying capabilities may turn the category system into more of a flat namespace making that less of an issue. > > * Readers can search, querying across categories regardless of whether > they're implicit or explicit > * -> A search for the intersection of <People from the Netherlands> with > <Politicians> will effectively return results for <Politicians from the > Netherlands> (and the user doesn't need to know or care that this is an > extant or non-extant category) We would need some system to turn fake cats into real queries. I suppose users could make redirects. The alternative of magic nlp sounds difficult > * -> Searches might be more than just intersections, e.g. "<Painters from > the United Kingdom> AND <Living people> NOT <Members of the Royal Academy>" > or whatever. > * -> Such queries might be cached (and, indeed, the intersections that > people search for might be used to suggest new categorisation schemata that > wikis had previously not considered - e.g. <British politicians> & <People > with pet cats> & <People who died in hot-ballooning accidents) Dealing with cache invalidation (unless it is quite coarse grained) may be difficult. > > * Editors can tag articles with leaf or branch categories, potentially > over-lapping and the system will rationalise the categories on save to the > minimally-spanning subset (or whatever is most useful for users, the > database, and/or both) That's quite an interesting idea, and one I haven't heard before from previous times this has been brought up. One concern id have is how to figure out which categories to list at the bottom of the page (all that could fit, or only the base categories, and how to determine what that is) > * -> Editors don't need to know the hierarchy of categories *a priori* when > adding pages to them (yay, less difficulty) > * -> Power editors don't need to type in loads of different categories if > they have a very specific one in mind (yay, still flexible) > * -> Categories shown to readers aren't necessarily the categories saved in > the database, at editorial judgement (otherwise, would a page not be in > just a single category, namely the intersection of all its tagged > categories?) > > Apart from the time and resources needed to make this happen and > operational, does this sound like something we'd want to do? It feels like > this, or something like it, would serve our editors and readers the best > from their perspective, if not our sysadmins. :-) > > [Snip] > > > > I think the best place to pursue this topic is probably in > > https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely > > Wikimedia Foundation will be able to make engineers available to work on > > this anytime soon, but I would not be surprised if the Wikidata > > developer community or volunteers found this interesting enough to work on. > > > I guess I should post this there too, maybe once someone's told me if it's > mad-cap. ;-) > I think you have captured what a lot of people want in a somewhat dreamy sense. However there is still a lot to do to make that vision concrete. In particular i think there would be non trivial ui challanges to make this understandable to the user. ---- From what I hear wikidata phase 3 is going to basically be support for inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there. If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet) more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible? Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising) -bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l