I have created a ticket here https://issues.apache.org/jira/browse/TIKA-2037

Regards,
Eli Trucco

On 20.07.2016 16:56, Allison, Timothy B. wrote:
Are you able to share some examples?  If so, please open a ticket.

-----Original Message-----
From: Eli Trucco [mailto:theknight...@yahoo.com]
Sent: Wednesday, July 20, 2016 10:39 AM
To: user@tika.apache.org
Subject: Problems with email attachments

Hi guys,

So I'm currently writing a small app that reads a directory and generally 
parses all documents inside it including extracting all their 
attachments/embedded files (if exist). I use Tika to achieve this, however I 
stumbled across a couple of problems while parsing .eml files from Thunderbird. 
Some of them are wrongly identified (as text/html, or
application/xhtml+xml) and in a lot of them, the attachments are not detected. 
I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and 
at least 10 of them are either identified as html, or correctly identified as 
rfc822 but the attachments are not extracted. I tried the same files using 
TikaCLI -z option with the same result :(

What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract 
and store the attachments somewhere else (exactly as shown in this example code 
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
Is it the correct, or is that another way to do this? Any idea to improve the 
type detection or how to extract the attachments better would be really 
appreciated !

Regards,

Eli Trucco


Reply via email to