Hi, I thank Mattmann Chris and Paul Jakubik.
There is no official spec of the markup langage. There are some parsers... I find "Wiki2HtmlJavaProgram" (http://community.jboss.org/wiki/Wiki2HtmlJavaProgram) and "jwpl" (http://code.google.com/p/jwpl/). Perhaps it's best to start from scratch with antlr ? My use case is to catch and analyse wikipedia pages about natural life theme. I think there is no mime type for wikipedia source code page. M. Mattman says if "someone... wants to throw out there a best practice on the MIME spec"... what is that ? Perhaps I'm going to start My Little Tika Project on that, but I'm afraid I cant't do all the mess and stuff and foo about mediawiki markup... Thanks. 2010/10/24 Mattmann, Chris A (388J) <[email protected]>: > Hi Guys, > >> [...] >> Until there is a complete spec for parsing media wiki markup, or a java >> library that does a good job of extracting text from documents formatted with >> media wiki markup, I don't think extracting text from media wiki markup >> documents is in scope for Tika. > > I'd disagree with that. We never have complete specs for *many* of the > existing formats we tackle in Tika, and there are exceptions and bugs and > platform-specific things that are found all the time that require > accommodations. > > I'd say if someone can find a parsing library for Media-wiki format, and > wants to throw out there a best practice on the MIME spec, or if someone was > even willing to roll their own parsing library, I'd welcome the > contribution. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >
