On Wed, 23 Dec 2020, Peter Kronenberg wrote:
But yet, if I understand correctly, using a TikaInputStream *will* spool the entire stream to disk so it can read everything, right? If I re-read the stream to parse, is it making 2 passes?

TikaInputStream has logic in it dump the stream to a temp file so it can be re-read multiple times as required. It only does that dump if required though, for formats that don't need it, it just acts as a buffering / mark + reset Stream

In my use case, we will not have any filename or metadata. It will just be a stream. But you're right in that we will want to parse it. So it sounds like the best way to do it is to do the detect on the first few bytes, which will at least give you an idea of what it is, but not precise. (Should this be a TikaStream?) And then do the parse.

Best is to wrap as a TikaInputStream, detect using all the detectors via DefaultDetector, then parse after that.

I'm still surprised, however, that the mimetype doesn't seem to appear on the Metadata after parsing.

IIRC it does if you use AutoDetectParser but not always otherwise, but I'm not certain on that...

Nick

Reply via email to