On 11/29/2011 11:18 PM, Platonides wrote:
> It might be faster to process the sql file
> with string tools (unlikely, having to match pageids with namespaces)
> but using a user level mysqld for that is completely overkill.

I succeeded to do this in 120 lines of Perl. First I parse page.sql,
where page_id and page_namespace are the first two columns, setting
up a bitmap to indicate which page_id are in the relevant
namespaces (article, File, Author, Creator, Index, Museum, Portal).

Then I parse externallinks.sql, where el_from (= page_id) and el_to
(= the URL) are the first two columns. I check if el_from is in my
bitmap and cut out the domain name from the URL, counting its
occurrences in a hash. Finally, I sort the hash keys to get the
output in alphabetic order, and print only those domains that have
10 links or more (the long tail of less popular domains are of no
interest to me).

For both commonswiki and enwiki, this runs in 12 minutes and the
RAM footprint on wolfsbane stays under 250 MB. It takes far longer
to download the database dumps to the toolserver. It would make
sense to run this on dumps.wikimedia.org, as part of the dump
process.

For those wikis I have tested, the output looks very similar to
what I got from looking at the replicated database, except that
all external links to WMF sites seem to have been removed from
the SQL dumps.

Based on the database dump of frwiki-20111123, I got:

  350570 toolserver.org
   90505 culture.gouv.fr
   52081 legifrance.gouv.fr
   51837 imdb.fr
   50189 akas.imdb.com
   46754 books.google.fr
   38654 ncbi.nlm.nih.gov
   38184 recensement.insee.fr
   36028 catalogueoflife.org
   35382 insee.fr

Based on the live, replicated frwiki_p database:

  352260 toolserver.org
  101619 commons.wikimedia.org
   90281 culture.gouv.fr
   82110 en.wikipedia.org
   52161 legifrance.gouv.fr
   52026 imdb.fr
   50379 akas.imdb.com
   46860 books.google.fr
   38715 ncbi.nlm.nih.gov
   38197 recensement.insee.fr


-- 
   Lars Aronsson ([email protected])
   Aronsson Datateknik - http://aronsson.se



_______________________________________________
Toolserver-l mailing list ([email protected])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to