Hi Tom,

I was hoping not to introduce heuristics before scanning the images but sounds 
like the page segmentation in tesseract is not smart enough.
So from what you say, if the input image is:

a) "square-ish" : PSM 10 Single Character
b) approx. single-multiple of character height in given font: PSM 6 Single Line
c) approx. Nx character height: PSM 6 Uniform Block

For your reference, closed captions used in US, Canada, and Korea are text 
based. DVB Subtitles, used in the rest of the world, are bit map pictures.

Danny

> On 17 Oct 2023, at 04:29, Tom Morris <[email protected]> wrote:
> 
> It's not terribly surprising that "page" segmentation gets confused by a 
> single character, although I'm a little surprised that it came up with 
> overlapping bounding boxes. 
> 
> Since the TV image capture is presumably fixed resolution and it sounds like 
> you've only got a single font to deal with, it seems like you can tell based 
> on the image bounds whether you've got a single line (PSM 7) or more than one 
> line (PSM 6).
> 
> It's been a long time since I looked at it, but closed captioning is usually 
> encoded in the signal digitally in a side band channel, which would be a much 
> simpler way to extract it.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6727242B-34CA-4234-BF1E-746510705817%40mac.com.

Reply via email to