Hello,

I saved 2 .eml files saved by my Thunderbird, and one of them contained plain text content, whereas other one rich HTML content.

The plain text one got recognized by Tika as "message/rfc822" file, but the other one incorrectly as "text/html" (and textual content being incorrectly extracted).

Any suggestion how to overcome this ?

Here is my HTML .eml file from Thunderbird:

X-Mozilla-Status: 0001
X-Mozilla-Status2: 01000000
X-Mozilla-Keys:
FCC: mailbox://[email protected]/Sent
X-Identity-Key: id1
X-Account-Key: account1
From: Vjeran Marcinko <[email protected]>
Subject: My rich mail with signature
To: [email protected]
Message-ID: <[email protected]>
Date: Fri, 13 Nov 2015 07:07:42 +0100
X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0; uuencode=0;
 attachmentreminder=0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.3.0
MIME-Version: 1.0
Content-Type: multipart/related;
 boundary="------------010102060501000809020808"

This is a multi-part message in MIME format.
--------------010102060501000809020808
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title>My rich mail with signature</title>
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    This is the beginning of <b>rich formatted email text</b>. Here is
    my signature: <img alt="here should be signature picture"
      src="cid:[email protected]" height="104"
      width="182" align="middle"><br>
    After that the <font color="#ff0000">RED COLOR </font> is shown.<br>
    <br>
  </body>
</html>

--------------010102060501000809020808
Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
CABoALYDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
....

Reply via email to