On Sun, Oct 24, 2010 at 8:32 AM, Ista Pouss <[email protected]> wrote:
> 2010/10/24 Jukka Zitting <[email protected]>: > > > > No. MediaWiki uses a database backend instead of a special file format > > for storing data, so you'd need to use something like the ManifoldCF > > (http://incubator.apache.org/connectors/) to extract information from > > a MediaWiki installation. > > > > Yes, but it's also possible to use the media wiki API > (http://www.mediawiki.org/wiki/API) and read json, yaml, xml etc > format. It's also possible to read the mediawiki code of a simple page > (http://en.wikipedia.org/wiki/Lucene?action=raw, to get the mediawiki > source of Lucene page). Is it possible to make an extractor with that, > or is it best to do with Manifold ? > > I think using the media wiki API is outside of the scope of Tika. Parsing media wiki markup, and extracting text from a document that is formatted using media wiki markup is potentially inside the scope of Tika. This text extraction would not load templates or other content that the markup referred to and would normally appear as part of the page, but it could extract the text present in a raw media wiki formatted document. Even if this limitation is acceptable, I think we would also have to answer the following questions before media wiki parsing could be added to Tika: - How would Tika know that it needs to perform media wiki markup parsing? Is there a mime type for that? - Is there a parsing library available for extracting text from media wiki markup documents? I think the second item is the more difficult part of extracting text from media wiki markup. While many uses of media wiki markup are simple and straightforward, the entire markup language is not. As far as I can tell, there is still no spec for the language though this page appears to be the current best attempt at a spec http://www.mediawiki.org/wiki/Markup_spec. Until there is a complete spec for parsing media wiki markup, or a java library that does a good job of extracting text from documents formatted with media wiki markup, I don't think extracting text from media wiki markup documents is in scope for Tika. Paul
