I have to check if the sample I have can be released publicly. I can tell you that the only thing I've seen open it properly (so far) is actually MSFT Outlook, where it does say in the From metadata "Saved by Windows Internet Explorer 8".

When I ask Tika 1.3 to detect the format of this file it says:
java -jar ./tika-app-1.3.jar -d Receipt.mht
application/vnd.ms-outlook

And if you try and open it with a text-editor it must be compressed somehow, because it's illegible.
java -jar ./tika-app-1.3.jar -m Receipt.mht
Author: Saved by Windows Internet Explorer 8
Content-Length: 43008
Content-Type: application/vnd.ms-outlook
Creation-Date: 2012-10-05T14:44:15Z
Last-Modified: 2012-10-05T14:44:15Z
Last-Save-Date: 2012-10-05T14:44:15Z
Message-Bcc:
Message-Cc:
Message-From: Saved by Windows Internet Explorer 8
Message-To:
creator: Saved by Windows Internet Explorer 8
date: 2012-10-05T14:44:15Z
dc:creator: Saved by Windows Internet Explorer 8
dc:description: Receipt
dc:title: Receipt
dcterms:created: 2012-10-05T14:44:15Z
dcterms:modified: 2012-10-05T14:44:15Z
meta:author: Saved by Windows Internet Explorer 8
meta:creation-date: 2012-10-05T14:44:15Z
meta:save-date: 2012-10-05T14:44:15Z
modified: 2012-10-05T14:44:15Z
resourceName: Receipt.mht
subject: Receipt
title: Receipt

On 2/6/2013 11:49 AM, Nick Burch wrote:
On Wed, 6 Feb 2013, AJ Weber wrote:
Anyone know if proper detection of MHT/MHTML files is on the roadmap for Tika?

Tika can already detect MHTML files, and parse them. We have unit tests for it.

However, there might be more than one format using that extension...

I see that the format is a "close relative" of an outlook MSG file (it's got a mime-encapsulated format), and that's what Tika appears to think they are -- but they're not.

None of the .mhtml files in the Tika test suite are anything like an Outlook MSG file - they're all mbox / rfc822 style ones. (*.mht and *.mhtml are both glob aliases of message/rfc822)

Do you have a small sample file of your other kind of file? And do you know what software generated it, and what that software calls the file format?

Nick

Reply via email to