Thanks for the information, Merlijn. Will take a look at some of the links you posted.
On Wednesday, April 27, 2022 at 10:18:39 AM UTC-7 Merlijn Wajer wrote: > Hi, > > On 27/04/2022 19:07, Brad wrote: > > For V5.10.0 of Tesseract, one of the changes is: > > (correction: version 5.1.0) > > >> Handle image and line separator regions in ALTO, hOCR and text output > > formats. > > > > I'm curious about what this means. Can Tesseract be used to identify > > rectangles and such on an image that might surround a text region, and if > > so, is this what this is referring to? Are there any examples showing how > > this works? > > Here is the commit in question: > > https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053 > > (for some background see > https://github.com/tesseract-ocr/tesseract/pull/3710) > > The output added to say hOCR is "ocr_photo" and "ocr_separator". You can > see how the results are iterated over in the source if you would like to > use that yourself. > > My/our immediate use case is detecting photos on pages of books and > articles, which will be emitted as ocr_photo when outputting hOCR. > > I don't know if this can help in your specific use case, but if you're > interested in finding images, it will help for sure. I cannot really > comment on the ocr_separator parts so much. > > Regards, > Merlijn > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1f4e7adb-85a3-4b06-a9d1-28d32b02390bn%40googlegroups.com.

