[ 
https://issues.apache.org/jira/browse/TIKA-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremias Maerki updated TIKA-225:
---------------------------------

    Attachment: detection-bugfixes.diff

> [PATCH] Various bugfixes for MIME detection
> -------------------------------------------
>
>                 Key: TIKA-225
>                 URL: https://issues.apache.org/jira/browse/TIKA-225
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.4
>            Reporter: Jeremias Maerki
>             Fix For: 0.4
>
>         Attachments: detection-bugfixes.diff, test-files.zip
>
>
> Here's a patch that solves the following issues:
> - text/plain's priority is too high. The BOMs are also used by XML so it must 
> be ensured that text/plain is not found too soon.
> - *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. 
> XSLT has its own MIME type.
> - Consolidated the two XHTML entries.
> - Fixed a bug in the existing XML magics which cause plain XML files to be 
> detected as text/plain.
> - Added magics for UTF-16 encoding. (Some magics are still missing: 
> http://www.w3.org/TR/xml/#sec-guessing)
> - Added entry for XSLT
> - XML namespace detection didn't work if namespace prefixes are used 
> (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an 
> additional detection step that fires up an XML parser to determine the root 
> element. Of course, this could probably be done without an XML parser but I 
> had limited time available.
> - Added a test case for some files (test files in separate ZIP, to be placed 
> under tika-core\src\test\resources\org\apache\tika\mime)
> HTH

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to