Re: Tika for mediawiki ?

Paul Jakubik Sun, 24 Oct 2010 08:01:18 -0700

On Sun, Oct 24, 2010 at 8:32 AM, Ista Pouss <[email protected]> wrote:


> 2010/10/24 Jukka Zitting <[email protected]>:
> >
> > No. MediaWiki uses a database backend instead of a special file format
> > for storing data, so you'd need to use something like the ManifoldCF
> > (http://incubator.apache.org/connectors/) to extract information from
> > a MediaWiki installation.
> >
>
> Yes, but it's also possible to use the media wiki API
> (http://www.mediawiki.org/wiki/API) and read json, yaml, xml etc
> format. It's also possible to read the mediawiki code of a simple page
> (http://en.wikipedia.org/wiki/Lucene?action=raw, to get the mediawiki
> source of Lucene page). Is it possible to make an extractor with that,
> or is it best to do with Manifold ?
>
>
I think using the media wiki API is outside of the scope of Tika.

Parsing media wiki markup, and extracting text from a document that is
formatted using media wiki markup is potentially inside the scope of Tika.
This text extraction would not load templates or other content that the
markup referred to and would normally appear as part of the page, but it
could extract the text present in a raw media wiki formatted document. Even
if this limitation is acceptable, I think we would also have to answer the
following questions before media wiki parsing could be added to Tika:

   - How would Tika know that it needs to perform media wiki markup parsing?
   Is there a mime type for that?
   - Is there a parsing library available for extracting text from media
   wiki markup documents?

I think the second item is the more difficult part of extracting text from
media wiki markup. While many uses of media wiki markup are simple and
straightforward, the entire markup language is not. As far as I can tell,
there is still no spec for the language though this page appears to be the
current best attempt at a spec http://www.mediawiki.org/wiki/Markup_spec.

Until there is a complete spec for parsing media wiki markup, or a java
library that does a good job of extracting text from documents formatted
with media wiki markup, I don't think extracting text from media wiki markup
documents is in scope for Tika.

Paul

Re: Tika for mediawiki ?

Reply via email to