On Fri, 26 Feb 2021, Peter Kronenberg wrote:
For most audio files, using the AudioParser, the buffer is still at the beginning. Even though there is no text extraction, I would think that Tika still needs to read through the stream. The MP3Parser consumes the stream, but the MP4Parser does not

IIRC the MP4 parsing library we use needs a File not a Stream, so we have to spool everything to disk

The OCR parser also leaves the pointer at the beginning. It definitely consumes the stream, so it must be resetting it.

OCR needs a file to call out to Tesseract with, so has to spool the stream to disk

So what is going on. And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?

Force Tika to spool to disk is probably the only way to be sure, assuming you don't have enough memory to always buffer everything in ram

Nick

Reply via email to