Re: OCR of other than PDF files

Tim Allison Mon, 11 Jan 2021 12:35:55 -0800

Yes.  Any complex file format that might contain an embedded image: html,
rtf, email, pst...anything.

When Tesseract was added to Tika, the idea was that if the application is
on the user's path, then Tika should call it on everything.

PDFs were something different though because of the potential DoS with
running ocr on inline images (e.g. thousands of inline images in a single
page).  So, we wound up with what we have now.

On Mon, Jan 11, 2021 at 10:37 AM Peter Kronenberg <[email protected]>
wrote:

> For other than PDF and Image files (e.g., PNG, JPEG), is OCR processing
> always applied, (assuming EnableImageProcessing=TRUE).  For example, DOC,
> DOCX and other Office formats, I get both extracted text and OCRed images.
> Is this true for any files like that (can’t think of any other examples
> offhand that might have embedded images)
>

Re: OCR of other than PDF files

Reply via email to