Hi,

I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail with multipart/mixed content.
Please have a look at the attachment (renamed to txt).

The parsing result of Tika is the file in plain text including all headers an boundary elements.
The words in the attachment are also not parsed.

Is this the defined behaviour of Tika?

I only expected the body content of each part and used that simple test code to investigate that problem.


final TikaInputStream tikaInputStream = TikaInputStream.get(emlFile);

final Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.CONTENT_TYPE_HINT, "message/rfc822");
metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, "mixedtest.eml");

final BodyContentHandler handler = new BodyContentHandler(20000000);
final AutoDetectParser parser = new AutoDetectParser();
parser.parse(tikaInputStream, handler, metadata);

System.out.println(handler.toString());
System.out.println("type: " + parser.getDetector().detect(tikaInputStream, metadata));


The type result is "message/rfc822", which is pretty fine I think.

Do I have to parse the EML myself into different content parts to get only the body text of each part?
I expected that Tika resolves that kind of multi-part files.

Best regards

Ingo


To: Ingo Siebert <[email protected]>
From: Ingo Siebert <[email protected]>
Subject: tika mixed test
Message-ID: <[email protected]>
Date: Wed, 5 Oct 2016 17:49:42 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="------------8160BFA2238522DB60B073B1"

This is a multi-part message in MIME format.
--------------8160BFA2238522DB60B073B1
Content-Type: multipart/alternative;
 boundary="------------DA6DCDF25D007EBA77D329FA"


--------------DA6DCDF25D007EBA77D329FA
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Hi *Test*,

that is *bold*.



--------------DA6DCDF25D007EBA77D329FA
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>Hi <b>Test</b>,</p>
    <p>that is <b>bold</b>.</p>
    <p><br>
    </p>
  </body>
</html>

--------------DA6DCDF25D007EBA77D329FA--

--------------8160BFA2238522DB60B073B1
Content-Type: text/plain; charset=UTF-8;
 name="words.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="words.txt"

ZG9nIGNhdA0KaG91c2Ugcm9vZg==
--------------8160BFA2238522DB60B073B1--

Reply via email to