On Wed, 7 May 2014, Tamás Cservenák wrote:
https://issues.apache.org/jira/browse/TIKA-1292
In short: it's about Tika Detector detecting a JAR file (correct ZIP file,
with proper magic bytes, etc) as "text/html" instead of expected
"application/java-archive".
Any chance you could find a jar file that shows the problem, and upload it
to the bug?
We probably just need to tweak the relative priorities of some matches, or
something like that, but we need a sample file to check
Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even
the globs defined for text/html (*.htm and *.html) does not match for the
JAR file above (*.jar), still, Tika selects the HTML mime type....
The container detectors generally only specialise the type, after the mime
magic and filename matches have kicked things off. Mime magic will beat
filename matches, since files can (and often are) renamed, while a
specific match off a container detector will beat those two (assuming the
container detector decides to check it)
Nick