[ https://issues.apache.org/jira/browse/TIKA-257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-257. -------------------------------- Resolution: Fixed Fix Version/s: 0.4 Assignee: Jukka Zitting I found a pretty accurate magic byte pattern (the file name string [Content_Types].xml at offset 30) for OOXML files. This still doesn't tell whether the document is a spreadsheet, a presentation or something different, but at least it's enough to allow Tika to correctly send the document to OOXMLParser for more detailed processing with POI. I added the byte pattern and made some related adjustments in revision 793696. The above test case now passes. Resolving as Fixed. > Uncorrect mime-type detection for ooxml > --------------------------------------- > > Key: TIKA-257 > URL: https://issues.apache.org/jira/browse/TIKA-257 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.4 > Reporter: Maxim Valyanskiy > Assignee: Jukka Zitting > Fix For: 0.4 > > > MimeTypes detects docx (and other office XML documents) as 'application/zip' > when file does not have proper extension: > $ java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m > /home/maxcom/download-tmp/proto.docx > Content-Type: > application/vnd.openxmlformats-officedocument.wordprocessingml.document > resourceName: proto.docx > $ cat /home/maxcom/download-tmp/proto.docx | java -jar > tika-app/target/tika-app-0.4-SNAPSHOT.jar -m > Content-Type: application/zip > This breaks text extraction when filename is not known -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.