On Fri, 26 Feb 2021, Peter Kronenberg wrote:
For most audio files, using the AudioParser, the buffer is still at the beginning. Even though there is no text extraction, I would think that Tika still needs to read through the stream. The MP3Parser consumes the stream, but the MP4Parser does not
IIRC the MP4 parsing library we use needs a File not a Stream, so we have to spool everything to disk
The OCR parser also leaves the pointer at the beginning. It definitely consumes the stream, so it must be resetting it.
OCR needs a file to call out to Tesseract with, so has to spool the stream to disk
So what is going on. And now I get back to my original question, which is, what is the best way to consistently be able to re-use the stream?
Force Tika to spool to disk is probably the only way to be sure, assuming you don't have enough memory to always buffer everything in ram
Nick
