Hi Nick Thanks for response. I already created an issue TIKA-1292 that has: - Ref to Jar - Small Github project with reproduce case - PR with trivial patch for priority that fixes this specific case - PR with bigger patch that tries to fix conflicting priorities
thanks, ~t~ (mobile) On May 19, 2014 10:59 PM, "Nick Burch" <[email protected]> wrote: > On Wed, 7 May 2014, Tamás Cservenák wrote: > >> https://issues.apache.org/jira/browse/TIKA-1292 >> >> In short: it's about Tika Detector detecting a JAR file (correct ZIP file, >> with proper magic bytes, etc) as "text/html" instead of expected >> "application/java-archive". >> > > Any chance you could find a jar file that shows the problem, and upload it > to the bug? > > We probably just need to tweak the relative priorities of some matches, or > something like that, but we need a sample file to check > > Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even >> the globs defined for text/html (*.htm and *.html) does not match for the >> JAR file above (*.jar), still, Tika selects the HTML mime type.... >> > > The container detectors generally only specialise the type, after the mime > magic and filename matches have kicked things off. Mime magic will beat > filename matches, since files can (and often are) renamed, while a specific > match off a container detector will beat those two (assuming the container > detector decides to check it) > > Nick
