https://bugzilla.wikimedia.org/show_bug.cgi?id=14237

--- Comment #10 from Philippe Verdy <verd...@wanadoo.fr> 2010-09-30 22:42:08 
UTC ---
note that multiple parameters for the syntax I propose may be reduced to just
one:
{{PAGESINCATEGORY:categoryname|restriction}}
where restriction may be:
- "" : no namespace id at all, useful to add namespaces
- "*" : all namespace ids (the default), useful to remove namespaces
followed by one or more of:
- "+id" : add this namespace id to the current list
- "-id" : remove this namespace id from the list

if the restriction does not start by "*" or "+" or "-", then "+" is implied
The namespace id could be either the numeric id, or a selector like "talk" to
select all talk namespaces, and "subject" to select all subject namespaces.

The namespace id can then take the forms:

- an integer, the raw namespace number

- a name, a namespace name (converted to a namespace id, should recognize the
synonyms, notably localized names or English names, or site-specific names)

- "odd": all odd namespace ids (i.e. "talk" namespaces associated to any
subject namespace)

- "even": all even namespace ids (i.e. "subject" namespaces)

For example:

- {{PAGESINCATEGORY:categoryname|*}} : equivalent to
{{PAGESINCATEGORY:categoryname|}} and to {{PAGESINCATEGORY:categoryname}}
(existing syntax)

- {{PAGESINCATEGORY:categoryname|:}} : count only pages of the main namespace,
that are members of the specified category name

- {{PAGESINCATEGORY:categoryname|0}} : count only pages of the main namespace,
that are members of the specified category name ; equivalent to
{{PAGESINCATEGORY:categoryname|+0}}

- {{PAGESINCATEGORY:categoryname|+project:+talk}} : count only pages of the
"project:" or of any talk namespaces, that are members of the specified
category name ; equivalent to {{PAGESINCATEGORY:categoryname|+0}}

- {{PAGESINCATEGORY:categoryname|-talk}} : count all pages of any namespace
excluding the talk namespaces (odd ids) that are members of the specified
category name; equivalent to {{PAGESINCATEGORY:categoryname|*-talk}}

The restriction can easily be implemented as WHERE clauses in the SQL select
that will match the specified namespace ids, combined as a parenthetic list of
'OR id=value' (positive selections), followed by a list of exclusions with 'AND
NOT id=value' (negative selections), and possibly with the "IN" operator if
sets are available in the SQL syntax.

Some ideas about the SQL server-side cost of counting members in a specific
category:

The SQL cost should with the restrictions above will be either the same (or
better) as performing a select without the namespace restriction (because this
is just a restriction of the existing syntax, and this should never reduce the
selectivity of the SQL query, but may in fact help to improve it).

However, this means that the existing restriction (for costly parser functions)
should remain (because counting pages that are members of a category,
independantly of which namespace they belong may be costly in very populated
categories, depending on how members of categories are indexed).

As this cost is effectively the cost of a:

 SELECT COUNT(*) from categorymembers
 WHERE category_pageid = $CATEGORYPAGEID
 AND member_namespaceid = $CATEGORYNAMESPACEID

aggregate (note: I don't know the exact schema impelementation which varies
across Mediawiki versions, so replace the table names and column names
appropriately), one way to solve it would be to use:

 SELECT 1 from categorymembers
 WHERE category_pageid = $CATEGORYPAGEID
 AND member_namespaceid = $CATEGORYNAMESPACEID
 LIMIT 50

and then let the PHP code count the returned "1" rows: if there are 50 rows,
then the category is too much populated, and COUNT(*) may take time, so the
function can be considered costly. If the cost limit is reached, just return
this limit value to the page calling the function, otherwise perform the same
select, replacing "SELECT 1" by "SELECT COUNT(*)" (without the LIMIT clause) to
return the exact value, or return the last known estimate from a separate
caching aggregate table that will be updated separately (using a max timestamp
of validity), to avoid reusing the same aggregate repetitively because of
templated pages using this function and frequent accesses by many users viewing
or editing various pages.

The value specified in the "LIMIT" clause above (here "50") may be tuned; and
this first check (for performance) may be removed completely, or removed if the
SQL schema includes an index that precompute aggregates for counting members in
each specific category (in which case there will not be any need to perform a
SELECT COUNT(*) aggregate, given that the count will be retrieved directly from
a precomputed aggregate caching table, that should be updated asynchronously,
either as a batch, or when the selective SELECT in the cache detects that the
stored value is out of date, in which case it will perform the SELECT COUNT(*)
from the non-cached table, just to update the caching table and its timestamp).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to