On Wed, 23 Dec 2020, Peter Kronenberg wrote:
Best is to wrap as a TikaInputStream, detect using all the detectors via >DefaultDetector, then parse after that.

But sometimes the detect will read the whole file, right? For example, for Word. So is it then making 2 passes?

Nope, we stash the open container ready for re-use by the parser
https://tika.apache.org/1.24.1/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer--

IIRC it does if you use AutoDetectParser but not always otherwise

Oh, ok, you’re right. It’s listed as Content-Type. I was searching for Mime-type. 😊

Yes, that's the standard http header for it, and we try to re-use existing definitions where possible!

Nick

Reply via email to