On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson <[email protected]> wrote: >> * Build a graph of Wikipedia articles in the main namespace, with >> wikilinks as vertexes. Since some pages are not reachable from other >> pages, this is actually N disconnected graphs. >> * Remove all edges which refer to disambiguation pages, date pages, or >> lists >> * Remove the graph which contains the main page >> * Produce a list of all remaining graphs. >> >> Is that roughly correct? > > It is roughly correct description of one of Golem's processing stages.
Well, I just did the experiment for German Wikipedia, using page_id pairs in a temporary, non-memory table. In my user database (on the same server as dewiki_p): mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER ) ENGINE=InnoDB ; mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and p1.page_namespace=0 ; Query OK, 34964160 rows affected (32 min 59.29 sec) Records: 34964160 Duplicates: 0 Warnings: 0 So, 35 million link pairs between namespace-0 pages, created in 33 minutes (~1 million links per minute). That's not too bad for our #2 wikipedia, and seems perfectly managable. Depending on your usage, now add indices and spices :-) Magnus _______________________________________________ Toolserver-l mailing list ([email protected]) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
