On Wed, 6 Feb 2013, AJ Weber wrote:
Anyone know if proper detection of MHT/MHTML files is on the roadmap for Tika?

Tika can already detect MHTML files, and parse them. We have unit tests for it.

However, there might be more than one format using that extension...

I see that the format is a "close relative" of an outlook MSG file (it's got a mime-encapsulated format), and that's what Tika appears to think they are -- but they're not.

None of the .mhtml files in the Tika test suite are anything like an Outlook MSG file - they're all mbox / rfc822 style ones. (*.mht and *.mhtml are both glob aliases of message/rfc822)

Do you have a small sample file of your other kind of file? And do you know what software generated it, and what that software calls the file format?

Nick

Reply via email to