RE: Mimetypes

Nick Burch Wed, 23 Dec 2020 06:27:54 -0800

On Wed, 23 Dec 2020, Peter Kronenberg wrote:

But yet, if I understand correctly, using a TikaInputStream *will* spoolthe entire stream to disk so it can read everything, right? If Ire-read the stream to parse, is it making 2 passes?

TikaInputStream has logic in it dump the stream to a temp file so it canbe re-read multiple times as required. It only does that dump if requiredthough, for formats that don't need it, it just acts as a buffering / mark+ reset Stream

In my use case, we will not have any filename or metadata. It will justbe a stream. But you're right in that we will want to parse it. So itsounds like the best way to do it is to do the detect on the first fewbytes, which will at least give you an idea of what it is, but notprecise. (Should this be a TikaStream?) And then do the parse.

Best is to wrap as a TikaInputStream, detect using all the detectors viaDefaultDetector, then parse after that.

I'm still surprised, however, that the mimetype doesn't seem to appearon the Metadata after parsing.

IIRC it does if you use AutoDetectParser but not always otherwise, but I'mnot certain on that...


Nick

RE: Mimetypes

Reply via email to