Hi, On Mon, Dec 7, 2009 at 3:47 AM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > Currently the tika-config.xml file maps three mime-types to the HtmlParser: > > <parser name="parse-html" > class="org.apache.tika.parser.html.HtmlParser"> > <mime>text/html</mime> > <mime>application/xhtml+xml</mime> > <mime>application/x-asp</mime> > </parser> > > I notice that facebook.com, if you don't specify an Accept: value in the > request header, returns this for the mime-type: > > application/vnd.wap.xhtml+xml > > Wondering if this should be added to the set, and if so then what other > variants like this are floating around.
Sounds good to me. For now we can add more types as we encounter them. > Or if we need something like "application/*.xhtml.xml" so that wildcards can > be used in mimetype patterns. Ideally I'd like to see the media type registry be smart enough to resolve such type relationships and the CompositeParser class improved to take advantage of that when choosing the best parser for an incoming document. BR, Jukka Zitting