Currently the tika-config.xml file maps three mime-types to the HtmlParser:

<parser name="parse-html" class="org.apache.tika.parser.html.HtmlParser">
                <mime>text/html</mime>
                <mime>application/xhtml+xml</mime>
                <mime>application/x-asp</mime>
        </parser>

I notice that facebook.com, if you don't specify an Accept: value in the request header, returns this for the mime-type:

application/vnd.wap.xhtml+xml

Wondering if this should be added to the set, and if so then what other variants like this are floating around.

Or if we need something like "application/*.xhtml.xml" so that wildcards can be used in mimetype patterns.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to