Hi,

I thank Mattmann Chris and Paul Jakubik.

There is no official spec of the markup langage.  There are some
parsers... I find "Wiki2HtmlJavaProgram"
(http://community.jboss.org/wiki/Wiki2HtmlJavaProgram) and "jwpl"
(http://code.google.com/p/jwpl/). Perhaps it's best to start from
scratch with antlr ?

My use case is to catch and analyse wikipedia pages about natural life theme.

I think there is no mime type for wikipedia source code page. M.
Mattman says if "someone... wants to throw out there a best practice
on the MIME spec"... what is that ?

Perhaps I'm going to start My Little Tika Project on that, but I'm
afraid I cant't do all the mess and stuff and foo about mediawiki
markup...

Thanks.


2010/10/24 Mattmann, Chris A (388J) <[email protected]>:
> Hi Guys,
>
>> [...]
>> Until there is a complete spec for parsing media wiki markup, or a java
>> library that does a good job of extracting text from documents formatted with
>> media wiki markup, I don't think extracting text from media wiki markup
>> documents is in scope for Tika.
>
> I'd disagree with that. We never have complete specs for *many* of the
> existing formats we tackle in Tika, and there are exceptions and bugs and
> platform-specific things that are found all the time that require
> accommodations.
>
> I'd say if someone can find a parsing library for Media-wiki format, and
> wants to throw out there a best practice on the MIME spec, or if someone was
> even willing to roll their own parsing library, I'd welcome the
> contribution.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Reply via email to