On Sat, Nov 21, 2009 at 6:39 PM, Aryeh Gregor
<[email protected]> wrote:
> Selecting a list of all titles that are not
> redirects will take a long time on any database, unless you have
> everything in memory, because it requires a table scan -- there's no
> index that covers the relevant columns (IIRC).

You could build an index on page_is_redirect in the "page" table (see
enwiki-*-page.sql.gz).  But I'm pretty sure Postgresql wouldn't use
it, and would do a sequential scan, since pretty much all the pages
are going to have to be accessed anyway.

Five or ten minutes sounds about right.  I can't imagine this is a
query you want to run over and over again.  If it is, you'd probably
want to use partitioning
(http://www.postgresql.org/docs/8.4/interactive/ddl-partitioning.html),
but you're still not going to cut down the query time very much, as
it's going to be returning millions of rows.

> It's also worth pointing out that Wikipedia uses a version of MySQL
> with substantial modifications, and Wikimedia sysadmins are very
> familiar with its behavior.  Switching to a new technology might
> theoretically be better in the long term (although I wouldn't take
> that for granted in this case), but the transition cost would be
> substantial.  Heck, Wikipedia hasn't even upgraded to MySQL 4.1, let
> alone a whole different DBMS.

Yes, it can be very hard to switch your DBMS, and that's a very good
thing for MySQL.  :)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to