Hi Tika experts,
Question : How to enable multiple parsers for specific mimetypes?
I am using tika to parse html pages.
My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
enabled for specific web related MIME types like *text/html, *
*application/xhtml+xml*.
>From my findings on tika wiki, this should be possible with CompositeParser
but I am not getting it right. Only the last parser registered for the mime
type seems to be working.
My configuration is given below.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
<parser class="org.apache.tika.parser.ner.NamedEntityParser">
<mime>text/plain</mime>
<mime>text/html</mime>
<mime>text/x-php</mime>
<mime>text/x-jsp</mime>
<mime>application/atom+xml</mime>
<mime>application/xhtml+xml</mime>
<mime>application/xml</mime>
<mime>application/rss+xml</mime>
<mime>application/pdf</mime>
<mime>application/atom+xml</mime>
<mime>application/msword</mime>
<mime>text/asp</mime>
</parser>
<parser class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
<mime>text/x-php</mime>
<mime>text/x-jsp</mime>
<mime>application/atom+xml</mime>
<mime>application/xhtml+xml</mime>
<mime>application/xml</mime>
<mime>application/rss+xml</mime>
<mime>application/atom+xml</mime>
<mime>text/asp</mime>
</parser>
</parsers>
</properties>
-
Thanks in advance
Thamme.
--
*Thamme Gowda N. *
Grad Student at usc.edu
Twitter: @thammegowda <https://twitter.com/thammegowda>
Website : http://scf.usc.edu/~tnarayan/