I'm running into problems with mimetype detection again.

I have a file named foo.xml . It should be detected as application/ xml. The thing is, within the first 64 bytes of each file is "<title>the title</title>". Because of this, Tika (with the 0.4 snapshot tika-mimetypes.xml) detects it as type/html, which is wrong. Changing the magic priority of text/html to be either higher or lower than that of application/xml doesn't do anything. The magic takes precedence over the glob pattern every time.

The easiest thing to do is just to edit tika-mimetypes.xml to remove the offending rule, which does work. But this does make me wonder if there is a way to tell Tika to match on the glob and then on the magic, instead of magic then glob?

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to