[ https://issues.apache.org/jira/browse/TIKA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeremias Maerki updated TIKA-225: --------------------------------- Attachment: detection-bugfixes.diff > [PATCH] Various bugfixes for MIME detection > ------------------------------------------- > > Key: TIKA-225 > URL: https://issues.apache.org/jira/browse/TIKA-225 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 0.4 > Reporter: Jeremias Maerki > Fix For: 0.4 > > Attachments: detection-bugfixes.diff, test-files.zip > > > Here's a patch that solves the following issues: > - text/plain's priority is too high. The BOMs are also used by XML so it must > be ensured that text/plain is not found too soon. > - *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. > XSLT has its own MIME type. > - Consolidated the two XHTML entries. > - Fixed a bug in the existing XML magics which cause plain XML files to be > detected as text/plain. > - Added magics for UTF-16 encoding. (Some magics are still missing: > http://www.w3.org/TR/xml/#sec-guessing) > - Added entry for XSLT > - XML namespace detection didn't work if namespace prefixes are used > (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an > additional detection step that fires up an XML parser to determine the root > element. Of course, this could probably be done without an XML parser but I > had limited time available. > - Added a test case for some files (test files in separate ZIP, to be placed > under tika-core\src\test\resources\org\apache\tika\mime) > HTH -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.