Y. That's the idea. I've seen some PDFs, and I am not making this up, where alternate pages were image only or text.
On Mon, Jan 11, 2021 at 9:41 AM Peter Kronenberg <[email protected]> wrote: > Can you check my understanding of OCR_STRATEGY=AUTO? Looking at the code > in AbstractPDF2XHTML, it appears to be done on a page by page basis. So if > the page satisfies the criteria of having a small amount of text, then the > entire page is OCRed. If the page is mostly searchable text, however, then > the text will be extracted. Is this correct? Each page is processed > independently? >
