Hi Tika experts,

Question : How to enable multiple parsers for specific mimetypes?

I am using tika to parse html pages.

My requirement is that both *NamedEntityParser* and *HtmlParser* has to be
enabled for specific web related MIME types like *text/html, *
*application/xhtml+xml*.

>From my findings on tika wiki, this should be possible with CompositeParser
but I am not getting it right. Only the last parser registered for the mime
type seems to be working.

My configuration is given below.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>

        <parser class="org.apache.tika.parser.ner.NamedEntityParser">
            <mime>text/plain</mime>
            <mime>text/html</mime>
            <mime>text/x-php</mime>
            <mime>text/x-jsp</mime>
            <mime>application/atom+xml</mime>
            <mime>application/xhtml+xml</mime>
            <mime>application/xml</mime>
            <mime>application/rss+xml</mime>
            <mime>application/pdf</mime>
            <mime>application/atom+xml</mime>
            <mime>application/msword</mime>
            <mime>text/asp</mime>
        </parser>

        <parser class="org.apache.tika.parser.html.HtmlParser">
            <mime>text/html</mime>
            <mime>text/x-php</mime>
            <mime>text/x-jsp</mime>
            <mime>application/atom+xml</mime>
            <mime>application/xhtml+xml</mime>
            <mime>application/xml</mime>
            <mime>application/rss+xml</mime>
            <mime>application/atom+xml</mime>
            <mime>text/asp</mime>
        </parser>
    </parsers>
</properties>



-
Thanks in advance
Thamme.

--
*Thamme Gowda N. *
Grad Student at usc.edu
Twitter: @thammegowda  <https://twitter.com/thammegowda>
Website : http://scf.usc.edu/~tnarayan/

Reply via email to