On Tue, 22 Dec 2020, Peter Kronenberg wrote:
I'm trying to detect the mimetype of a file using both
Tika.detect(InputStream)
and
Tika.detect(File)
I get 2 different results. I'm testing with a Microsoft Word (.doc) file.
The InputStream one is based on just the first few kb of the file. That's
enough to figure out it's an OLE2 file, but not what flavour
The File one reads the whole file, checks the OLE2 directory entries, and
identifies that you have a Word file
If you gave Tika the InputStream + filename on a Metadata object, it would
specialise the OLE2 type to Word based on the extension
If you gave Tika a TikaInputStream, it would detect that a File was needed
for a fully precise answer, spool the Stream to a File, then use that to
detect (and later parse if you need)
Nick