On Wed, 23 Dec 2020, Peter Kronenberg wrote:
Best is to wrap as a TikaInputStream, detect using all the detectors
via >DefaultDetector, then parse after that.
But sometimes the detect will read the whole file, right? For example,
for Word. So is it then making 2 passes?
Nope, we stash the open container ready for re-use by the parser
https://tika.apache.org/1.24.1/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer--
IIRC it does if you use AutoDetectParser but not always otherwise
Oh, ok, you’re right. It’s listed as Content-Type. I was searching for
Mime-type. 😊
Yes, that's the standard http header for it, and we try to re-use existing
definitions where possible!
Nick