Hi Nick
Thanks for response.

I already created an issue TIKA-1292 that has:
- Ref to Jar
- Small Github project with reproduce case
- PR with trivial patch for priority that fixes this specific case
- PR with bigger patch that tries to fix conflicting priorities

thanks,
~t~ (mobile)
On May 19, 2014 10:59 PM, "Nick Burch" <[email protected]> wrote:

> On Wed, 7 May 2014, Tamás Cservenák wrote:
>
>> https://issues.apache.org/jira/browse/TIKA-1292
>>
>> In short: it's about Tika Detector detecting a JAR file (correct ZIP file,
>> with proper magic bytes, etc) as "text/html" instead of expected
>> "application/java-archive".
>>
>
> Any chance you could find a jar file that shows the problem, and upload it
> to the bug?
>
> We probably just need to tweak the relative priorities of some matches, or
> something like that, but we need a sample file to check
>
>  Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even
>> the globs defined for text/html (*.htm and *.html) does not match for the
>> JAR file above (*.jar), still, Tika selects the HTML mime type....
>>
>
> The container detectors generally only specialise the type, after the mime
> magic and filename matches have kicked things off. Mime magic will beat
> filename matches, since files can (and often are) renamed, while a specific
> match off a container detector will beat those two (assuming the container
> detector decides to check it)
>
> Nick

Reply via email to