Thanks for the information, Merlijn. Will take a look at some of the links 
you posted.

On Wednesday, April 27, 2022 at 10:18:39 AM UTC-7 Merlijn Wajer wrote:

> Hi,
>
> On 27/04/2022 19:07, Brad wrote:
> > For V5.10.0 of Tesseract, one of the changes is:
>
> (correction: version 5.1.0)
>
> >> Handle image and line separator regions in ALTO, hOCR and text output
> > formats.
> > 
> > I'm curious about what this means. Can Tesseract be used to identify
> > rectangles and such on an image that might surround a text region, and if
> > so, is this what this is referring to? Are there any examples showing how
> > this works?
>
> Here is the commit in question: 
>
> https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053
>  
> (for some background see 
> https://github.com/tesseract-ocr/tesseract/pull/3710)
>
> The output added to say hOCR is "ocr_photo" and "ocr_separator". You can 
> see how the results are iterated over in the source if you would like to 
> use that yourself.
>
> My/our immediate use case is detecting photos on pages of books and 
> articles, which will be emitted as ocr_photo when outputting hOCR.
>
> I don't know if this can help in your specific use case, but if you're 
> interested in finding images, it will help for sure. I cannot really 
> comment on the ocr_separator parts so much.
>
> Regards,
> Merlijn
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1f4e7adb-85a3-4b06-a9d1-28d32b02390bn%40googlegroups.com.

Reply via email to