On Tue, 22 Dec 2020, Peter Kronenberg wrote:
I'm trying to detect the mimetype of a file using both

Tika.detect(InputStream)
and
Tika.detect(File)

I get 2 different results.  I'm testing with a Microsoft Word (.doc) file.

The InputStream one is based on just the first few kb of the file. That's enough to figure out it's an OLE2 file, but not what flavour

The File one reads the whole file, checks the OLE2 directory entries, and identifies that you have a Word file


If you gave Tika the InputStream + filename on a Metadata object, it would specialise the OLE2 type to Word based on the extension

If you gave Tika a TikaInputStream, it would detect that a File was needed for a fully precise answer, spool the Stream to a File, then use that to detect (and later parse if you need)

Nick

Reply via email to