Are you able to share some examples? If so, please open a ticket. -----Original Message----- From: Eli Trucco [mailto:theknight...@yahoo.com] Sent: Wednesday, July 20, 2016 10:39 AM To: user@tika.apache.org Subject: Problems with email attachments
Hi guys, So I'm currently writing a small app that reads a directory and generally parses all documents inside it including extracting all their attachments/embedded files (if exist). I use Tika to achieve this, however I stumbled across a couple of problems while parsing .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result :( What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java). Is it the correct, or is that another way to do this? Any idea to improve the type detection or how to extract the attachments better would be really appreciated ! Regards, Eli Trucco