On Monday, October 16, 2023 at 3:34:39 AM UTC-4 Danny wrote:
This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 line text blocks. And a 1-line text block might be a single character in this case. So the ideal page segmentation mode should be 6, no? But looking at the debug output, it thinks there are two characters in the input image... It's not terribly surprising that "page" segmentation gets confused by a single character, although I'm a little surprised that it came up with overlapping bounding boxes. Since the TV image capture is presumably fixed resolution and it sounds like you've only got a single font to deal with, it seems like you can tell based on the image bounds whether you've got a single line (PSM 7) or more than one line (PSM 6). It's been a long time since I looked at it, but closed captioning is usually encoded in the signal digitally in a side band channel, which would be a much simpler way to extract it. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1e42155-d71b-4bd3-8667-faa736a28c07n%40googlegroups.com.

