On Mon, Oct 25, 2010 at 3:14 AM, Jan Høydahl / Cominvent < [email protected]> wrote:
> > If I were you, I'd build a standalone program which interfaces your wiki > (If wikipedia, perhaps download it at > http://en.wikipedia.org/wiki/Wikipedia:Database_download), parses and feed > to your index or whatever you need. > > If your goal is to index or perform any kind of text analysis of mediawiki pages, I understand why you want to parse the page since the markup tends to mess up text analysis. If your goal is to look at wikipedia pages, I recommend downloading the Freebase Wikipedia Extraction (WEX) (http://download.freebase.com/wex/) instead of the wikipedia database download. If you download the articles (current latest at http://download.freebase.com/wex/latest/freebase-wex-2010-10-09-articles.tsv.bz2) one of the fields for each article is the text extracted from the wikipedia article. One of the original mediawiki developers wrote the text extractor for freebase, and if nothing else it does a better job of extracting text than the code I wrote a few years ago :-) While I still do a little bit of cleanup before performing text analysis on WEX, that cleanup is nothing compared to what I had to do to try to get clean text out of mediawiki formatted text. Paul
