Here's an Apache2 licensed parser: http://code.google.com/p/wikimodel/source/browse/trunk/org.wikimodel.wem/src/main/java/org/wikimodel/wem/mediawiki/
But I think too there's probably no need to involve Tika, unless you have a file system with tons of different files where plaintext wiki files are some of them. It is hard to detect the type of a mediawiki file, as I don't think they have a standard filename suffix or magic byte sequence. You'd need to start scanning for parts of the markup. If I were you, I'd build a standalone program which interfaces your wiki (If wikipedia, perhaps download it at http://en.wikipedia.org/wiki/Wikipedia:Database_download), parses and feed to your index or whatever you need. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 25. okt. 2010, at 09.27, Ista Pouss wrote: > Hi, > > I thank Mattmann Chris and Paul Jakubik. > > There is no official spec of the markup langage. There are some > parsers... I find "Wiki2HtmlJavaProgram" > (http://community.jboss.org/wiki/Wiki2HtmlJavaProgram) and "jwpl" > (http://code.google.com/p/jwpl/). Perhaps it's best to start from > scratch with antlr ? > > My use case is to catch and analyse wikipedia pages about natural life theme. > > I think there is no mime type for wikipedia source code page. M. > Mattman says if "someone... wants to throw out there a best practice > on the MIME spec"... what is that ? > > Perhaps I'm going to start My Little Tika Project on that, but I'm > afraid I cant't do all the mess and stuff and foo about mediawiki > markup... > > Thanks. > > > 2010/10/24 Mattmann, Chris A (388J) <[email protected]>: >> Hi Guys, >> >>> [...] >>> Until there is a complete spec for parsing media wiki markup, or a java >>> library that does a good job of extracting text from documents formatted >>> with >>> media wiki markup, I don't think extracting text from media wiki markup >>> documents is in scope for Tika. >> >> I'd disagree with that. We never have complete specs for *many* of the >> existing formats we tackle in Tika, and there are exceptions and bugs and >> platform-specific things that are found all the time that require >> accommodations. >> >> I'd say if someone can find a parsing library for Media-wiki format, and >> wants to throw out there a best practice on the MIME spec, or if someone was >> even willing to roll their own parsing library, I'd welcome the >> contribution. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >>
