On Sat, 6 Aug 2011, Florian Müller wrote:
I would like to check if the content type provided by a content repository actually matches the content. So I use Tika to detect the type from the content stream and compare it to the provided content type. That works well if the content types are the same or one is a subtype of the other. But there are some cases that require a more fuzzy comparison. If, for example, Tika detects "application/xhtml+xml" and the repository reports "text/html" then that would be a close enough match for my purpose.
Hmm, I'd say that we should probably have a common parent type for both of those two.
My feeling is that if two content types are similar like the HTML 4, HTML 5 and XHTML cases, then we should probably reflect that in the types heirarchy
That's certainly what we do for other formats. Take .docx for example, it's docx -> tika ooxml -> zip, so you have docx and xlsx as siblings, and distant relatives of other zip based office formats like keynote and odf
Nick
