Re: [tesseract-ocr] OCR Output contains "xlz"

Tom Morris Mon, 16 Oct 2023 13:30:04 -0700

On Monday, October 16, 2023 at 3:34:39 AM UTC-4 Danny wrote:


This raises a new issue: the input data (TV subtitles) are a mixture of 1 
or 2 line text blocks. And a 1-line text block might be a single character 
in this case. 

So the ideal page segmentation mode should be 6, no? But looking at the 
debug output, it thinks there are two characters in the input image...


It's not terribly surprising that "page" segmentation gets confused by a 
single character, although I'm a little surprised that it came up with 
overlapping bounding boxes. 

Since the TV image capture is presumably fixed resolution and it sounds 
like you've only got a single font to deal with, it seems like you can tell 
based on the image bounds whether you've got a single line (PSM 7) or more 
than one line (PSM 6).

It's been a long time since I looked at it, but closed captioning is 
usually encoded in the signal digitally in a side band channel, which would 
be a much simpler way to extract it.

Tom 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a1e42155-d71b-4bd3-8667-faa736a28c07n%40googlegroups.com.

Re: [tesseract-ocr] OCR Output contains "xlz"

Reply via email to