I have created a ticket here
https://issues.apache.org/jira/browse/TIKA-2037
Regards,
Eli Trucco
On 20.07.2016 16:56, Allison, Timothy B. wrote:
Are you able to share some examples? If so, please open a ticket.
-----Original Message-----
From: Eli Trucco [mailto:theknight...@yahoo.com]
Sent: Wednesday, July 20, 2016 10:39 AM
To: user@tika.apache.org
Subject: Problems with email attachments
Hi guys,
So I'm currently writing a small app that reads a directory and generally
parses all documents inside it including extracting all their
attachments/embedded files (if exist). I use Tika to achieve this, however I
stumbled across a couple of problems while parsing .eml files from Thunderbird.
Some of them are wrongly identified (as text/html, or
application/xhtml+xml) and in a lot of them, the attachments are not detected.
I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and
at least 10 of them are either identified as html, or correctly identified as
rfc822 but the attachments are not extracted. I tried the same files using
TikaCLI -z option with the same result :(
What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract
and store the attachments somewhere else (exactly as shown in this example code
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
Is it the correct, or is that another way to do this? Any idea to improve the
type detection or how to extract the attachments better would be really
appreciated !
Regards,
Eli Trucco