Currently the tika-config.xml file maps three mime-types to the
HtmlParser:
<parser name="parse-html"
class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
<mime>application/xhtml+xml</mime>
<mime>application/x-asp</mime>
</parser>
I notice that facebook.com, if you don't specify an Accept: value in
the request header, returns this for the mime-type:
application/vnd.wap.xhtml+xml
Wondering if this should be added to the set, and if so then what
other variants like this are floating around.
Or if we need something like "application/*.xhtml.xml" so that
wildcards can be used in mimetype patterns.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g