Paul Houle wrote: > I did a substantial project that worked from the XML dumps. I > designed a recursive descent parser in C# that, with a few tricks, > almost decodes wikipedia markup correctly. Getting it right is tricky, > for a number of reasons, however, my approach preserved some > semantics that would have been lost in the HTML dumps. > (...) > In your case, I'd do the following: install a copy of the > mediawiki software, > > http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia > <http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia> > > get a list of all the pages in the wiki by running a database > query, and then write a script that does http requests for all the > pages and saves them in files. This is programming of the simplest > type, but getting good speed could be a challenge. I'd seriously > consider using Amazon EC2 for this kind of thing, renting a big DB > server and a big web server, then writing a script that does the > download in parallel.
He could as well generate the static html dumps from that. http://www.mediawiki.org/wiki/Extension:DumpHTML I think he is better parsing the articles, though. For a linguistic research you don't need things such as the contents of templates, so a simple wikitext stripping would do. And it will be much, much, much, much faster than parsing the whole wiki. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l