Paul Houle wrote:
>      I did a substantial project that worked from the XML dumps.  I 
> designed a recursive descent parser in C# that,  with a few tricks,  
> almost decodes wikipedia markup correctly.  Getting it right is tricky,  
> for  a number of reasons,  however,  my approach preserved some 
> semantics that would have been lost in the HTML dumps.
> 
(...)
>      In your case,  I'd do the following:  install a copy of the 
> mediawiki software,
> 
> http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia 
> <http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia>
> 
>      get a list of all the pages in the wiki by running a database 
> query,  and then write a script that does http requests for all the 
> pages and saves them in files.  This is programming of the simplest 
> type,  but getting good speed could be a challenge.  I'd seriously 
> consider using Amazon EC2 for this kind of thing,  renting a big DB 
> server and a big web server,  then writing a script that does the 
> download in parallel.

He could as well generate the static html dumps from that.
http://www.mediawiki.org/wiki/Extension:DumpHTML

I think he is better parsing the articles, though.

For a linguistic research you don't need things such as the contents of
templates, so a simple wikitext stripping would do. And it will be much,
much, much, much faster than parsing the whole wiki.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to