Re: Tika for mediawiki ?

Paul Jakubik Mon, 25 Oct 2010 06:10:55 -0700

On Mon, Oct 25, 2010 at 3:14 AM, Jan Høydahl / Cominvent <
[email protected]> wrote:


>
> If I were you, I'd build a standalone program which interfaces your wiki
> (If wikipedia, perhaps download it at
> http://en.wikipedia.org/wiki/Wikipedia:Database_download), parses and feed
> to your index or whatever you need.
>
>
If your goal is to index or perform any kind of text analysis of mediawiki
pages, I understand why you want to parse the page since the markup tends to
mess up text analysis.

If your goal is to look at wikipedia pages, I recommend downloading the
Freebase Wikipedia Extraction (WEX) (http://download.freebase.com/wex/)
instead of the wikipedia database download. If you download the articles
(current latest at
http://download.freebase.com/wex/latest/freebase-wex-2010-10-09-articles.tsv.bz2)
one of the fields for each article is the text extracted from the wikipedia
article. One of the original mediawiki developers wrote the text extractor
for freebase, and if nothing else it does a better job of extracting text
than the code I wrote a few years ago :-)

While I still do a little bit of cleanup before performing text analysis on
WEX, that cleanup is nothing compared to what I had to do to try to get
clean text out of mediawiki formatted text.

Paul

Re: Tika for mediawiki ?

Reply via email to