Hello,
I saved 2 .eml files saved by my Thunderbird, and one of them contained
plain text content, whereas other one rich HTML content.
The plain text one got recognized by Tika as "message/rfc822" file, but
the other one incorrectly as "text/html" (and textual content being
incorrectly extracted).
Any suggestion how to overcome this ?
Here is my HTML .eml file from Thunderbird:
X-Mozilla-Status: 0001
X-Mozilla-Status2: 01000000
X-Mozilla-Keys:
FCC: mailbox://[email protected]/Sent
X-Identity-Key: id1
X-Account-Key: account1
From: Vjeran Marcinko <[email protected]>
Subject: My rich mail with signature
To: [email protected]
Message-ID: <[email protected]>
Date: Fri, 13 Nov 2015 07:07:42 +0100
X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0;
uuencode=0;
attachmentreminder=0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
Thunderbird/38.3.0
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="------------010102060501000809020808"
This is a multi-part message in MIME format.
--------------010102060501000809020808
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>My rich mail with signature</title>
</head>
<body text="#000000" bgcolor="#FFFFFF">
This is the beginning of <b>rich formatted email text</b>. Here is
my signature: <img alt="here should be signature picture"
src="cid:[email protected]" height="104"
width="182" align="middle"><br>
After that the <font color="#ff0000">RED COLOR </font> is shown.<br>
<br>
</body>
</html>
--------------010102060501000809020808
Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-ID: <[email protected]>
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
CABoALYDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
....