Re: [Wikitech-l] The never-dying topic: category intersection

Magnus Manske Tue, 02 Dec 2008 08:40:31 -0800

On Tue, Dec 2, 2008 at 2:57 PM, Aryeh Gregor
<[EMAIL PROTECTED]> wrote:
> On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
> <[EMAIL PROTECTED]> wrote:
>> (feel free to bash me if we had this variant already, I couldn't find
>> it in the list archives)
>>
>> Task: On German Wikipedia (yay atomic categories!), find women who
>> were born in 1901 and died in 1986.
>> Runtime : Toolserver, <2 sec
>> Query:
>> SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
>> page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
>> "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
>> tbl1.cnt = 3 ;
>
> This will fail with a syntax error on the main servers, because
> subqueries aren't supported in MySQL 4.0.  You don't really need the
> subquery, though; you should be able to just use HAVING:
>
> SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND
> cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY
> cl_from HAVING COUNT(cl_to) = 3;


Your're right. Fixed in the tool.

> Your solution requires filesorting the union of the categories, as far
> as I can tell.  I would expect it, offhand, to be significantly slower
> than a solution using joins:
>
> SELECT page_title FROM page JOIN categorylinks AS cl1 ON
> page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from
> JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE
> cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to =
> 'Gestorben_1986';
>
> But I haven't benchmarked it, and who knows what kind of execution
> quirks are happening here.

It seems the JOIN query is significantly faster when all categories are large.

However, with one or more small categories, I can do a pre-selection
of pages (get page_ids for the intersection of the small categories,
then look only for these in the larger ones), which in turn is
significantly faster than the JOIN.
My tool now uses the algorithm appropriate for the respective query.

>> Trying to "poison" the query by also looking in all GFDL images
>> ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
>> so not that bad.
>
> 3 seconds is a very long time for a query to run.  Typical queries
> should take more like, say, 10 ms.  Occasional selects taking three
> seconds might or might not kill the servers, but they're far from
> optimal.

I am uncertain how much the toolserver factors in here. The poor thing
is under a lot of stress ;-)

> Also, did you try in a really worst-case scenario, like
> intersecting "Unprintworthy redirects" with "Stub-Class biography
> articles" on enwiki?  Obviously users aren't likely to legitimately
> run an intersection of those exact categories (since they're logically
> disjoint), but you should test this kind of thing to ensure
> scalability.  The query appears to take 16s on your tool.

I ran it again now, and it falls back to the JOIN solution, taking ~10
sec. As a worst-case scenario, I call that acceptable for the tool.

It might not be acceptable for Wikipedia ATM. We could experiment how
this performs on the "real" servers, though.

Also, we could restrict certain queries. We know the category size,
and in my approach, we know how many articles are in the "small
category" intersection. Form there, we could guesstimate the
worst-case time, and kill the query, or run it in MySQL slow mode
(forgot the correct name) to not stress the servers too much.

> Again, the only really scalable solution looks to be fulltext search
> of some kind.  We've known for a long time that category intersections
> can easily be done well enough, for a modest standard of "well
> enough", but that hasn't been considered good enough to run on
> Wikipedia.

No matter what method, I think the problem should get high priority. I
currently see a case on Commons, where there's now "Category:Paintings
by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
(or is already there).

Magnus

P.S.: Just got a message on my Commons talk page about the
"+incategory:" search function.
* This is based on the Lucene index, right? How often is that updated?
* Is there a decent interface/special page for that? It's a pain to
enter this manually, and I doubt many people know about it
* Is there a machine-readable interface for this? One that will return
5K hits without screenscraping?

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] The never-dying topic: category intersection

Reply via email to