Yes. Any complex file format that might contain an embedded image: html, rtf, email, pst...anything.
When Tesseract was added to Tika, the idea was that if the application is on the user's path, then Tika should call it on everything. PDFs were something different though because of the potential DoS with running ocr on inline images (e.g. thousands of inline images in a single page). So, we wound up with what we have now. On Mon, Jan 11, 2021 at 10:37 AM Peter Kronenberg <[email protected]> wrote: > For other than PDF and Image files (e.g., PNG, JPEG), is OCR processing > always applied, (assuming EnableImageProcessing=TRUE). For example, DOC, > DOCX and other Office formats, I get both extracted text and OCRed images. > Is this true for any files like that (can’t think of any other examples > offhand that might have embedded images) >
