Re: Tika for mediawiki ?

Jan Høydahl / Cominvent Mon, 25 Oct 2010 01:15:01 -0700

Here's an Apache2 licensed parser: 
http://code.google.com/p/wikimodel/source/browse/trunk/org.wikimodel.wem/src/main/java/org/wikimodel/wem/mediawiki/


But I think too there's probably no need to involve Tika, unless you have a 
file system with tons of different files where plaintext wiki files are some of 
them. It is hard to detect the type of a mediawiki file, as I don't think they 
have a standard filename suffix or magic byte sequence. You'd need to start 
scanning for parts of the markup.

If I were you, I'd build a standalone program which interfaces your wiki (If 
wikipedia, perhaps download it at 
http://en.wikipedia.org/wiki/Wikipedia:Database_download), parses and feed to 
your index or whatever you need.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 25. okt. 2010, at 09.27, Ista Pouss wrote:

> Hi,
> 
> I thank Mattmann Chris and Paul Jakubik.
> 
> There is no official spec of the markup langage.  There are some
> parsers... I find "Wiki2HtmlJavaProgram"
> (http://community.jboss.org/wiki/Wiki2HtmlJavaProgram) and "jwpl"
> (http://code.google.com/p/jwpl/). Perhaps it's best to start from
> scratch with antlr ?
> 
> My use case is to catch and analyse wikipedia pages about natural life theme.
> 
> I think there is no mime type for wikipedia source code page. M.
> Mattman says if "someone... wants to throw out there a best practice
> on the MIME spec"... what is that ?
> 
> Perhaps I'm going to start My Little Tika Project on that, but I'm
> afraid I cant't do all the mess and stuff and foo about mediawiki
> markup...
> 
> Thanks.
> 
> 
> 2010/10/24 Mattmann, Chris A (388J) <[email protected]>:
>> Hi Guys,
>> 
>>> [...]
>>> Until there is a complete spec for parsing media wiki markup, or a java
>>> library that does a good job of extracting text from documents formatted 
>>> with
>>> media wiki markup, I don't think extracting text from media wiki markup
>>> documents is in scope for Tika.
>> 
>> I'd disagree with that. We never have complete specs for *many* of the
>> existing formats we tackle in Tika, and there are exceptions and bugs and
>> platform-specific things that are found all the time that require
>> accommodations.
>> 
>> I'd say if someone can find a parsing library for Media-wiki format, and
>> wants to throw out there a best practice on the MIME spec, or if someone was
>> even willing to roll their own parsing library, I'd welcome the
>> contribution.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>>

Re: Tika for mediawiki ?

Reply via email to