Hi,

On Mon, Dec 7, 2009 at 3:47 AM, Ken Krugler <kkrugler_li...@transpac.com> wrote:
> Currently the tika-config.xml file maps three mime-types to the HtmlParser:
>
>        <parser name="parse-html"
> class="org.apache.tika.parser.html.HtmlParser">
>                <mime>text/html</mime>
>                <mime>application/xhtml+xml</mime>
>                <mime>application/x-asp</mime>
>        </parser>
>
> I notice that facebook.com, if you don't specify an Accept: value in the
> request header, returns this for the mime-type:
>
> application/vnd.wap.xhtml+xml
>
> Wondering if this should be added to the set, and if so then what other
> variants like this are floating around.

Sounds good to me. For now we can add more types as we encounter them.

> Or if we need something like "application/*.xhtml.xml" so that wildcards can
> be used in mimetype patterns.

Ideally I'd like to see the media type registry be smart enough to
resolve such type relationships and the CompositeParser class improved
to take advantage of that when choosing the best parser for an
incoming document.

BR,

Jukka Zitting

Reply via email to