On Wed, 7 May 2014, Tamás Cservenák wrote:
https://issues.apache.org/jira/browse/TIKA-1292

In short: it's about Tika Detector detecting a JAR file (correct ZIP file,
with proper magic bytes, etc) as "text/html" instead of expected
"application/java-archive".

Any chance you could find a jar file that shows the problem, and upload it to the bug?

We probably just need to tweak the relative priorities of some matches, or something like that, but we need a sample file to check

Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even
the globs defined for text/html (*.htm and *.html) does not match for the
JAR file above (*.jar), still, Tika selects the HTML mime type....

The container detectors generally only specialise the type, after the mime magic and filename matches have kicked things off. Mime magic will beat filename matches, since files can (and often are) renamed, while a specific match off a container detector will beat those two (assuming the container detector decides to check it)

Nick

Reply via email to