Luís Filipe Nassif created TIKA-3771:
----------------------------------------

             Summary: Regression from TIKA-3687: Files wrongly detected as EML 
                 Key: TIKA-3771
                 URL: https://issues.apache.org/jira/browse/TIKA-3771
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.4.0
            Reporter: Luís Filipe Nassif
         Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png

Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the <match value="\nX-" type="string" 
offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to