[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

'Danny' via tesseract-ocr Mon, 05 Aug 2024 17:15:31 -0700

Hi Tom,

Thanks for the suggestion!

We've been using PSM 6 (Assume a single uniform block of text) and, for 
that input image, it outputs nothing for both the stock chi_tra.traineddata 
and our in-house trained data file.

However... I just tried PSM 13 ("Raw line. Treat the image as a single text 
line, bypassing hacks that are Tesseract-specific) and do get some output!

WIth chi_tra: same as you got: 我 是 說 …
With in house model: 我是說...」

The characters themselves are correct.  The stock chi_tra model puts an 
extra space after each character.  I recall reading a bug report about that 
somewhere.

The spacing with our model is better but it adds an extraneous closing 
square-quote.  Another difference (not so significant) is that the stock 
model outputs an ellipsis character while the in-house model outputs three 
periods.

However, once in a while the subtitle image has two lines of text, which is 
why we chose PSM 6.  

[image: multiline_sub_16.png]

I tried the image above with PSM 13 and the and unfortunately it failed 
with both the stock chi_tra and our in-house model: m 論
Using PSM 6 works (but again chi_tra adds the extra spaces. Our in-house is 
better)

So, I'm thinking the issue is with the preprocessing, segmentation, and 
glyph identification more than the model itself.  

On Tuesday, August 6, 2024 at 2:09:32 AM UTC+8 [email protected] wrote:

> On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote:
>
> I have a problem where tesseract produces no output (zero byte output 
> file) when presented with Chinese characters followed by either an ellipsis 
> or three periods.
>
> [image: bad_sub_243.png]
>
> If I crop the image in photoshop to remove the dots, the three Chinese 
> characters are recognized perfectly. Feeding the image above, or feeding 
> just the three dots, produces no output.
>
> I've just recompiled with the latest GIT version (see below).  I've also 
> re-trained the chi_tra model several times and added many words with the 
> three dots to the wordlist. The result is the same with both.
>
> Any suggestions?
>
> *Command*
> tesseract bad_sub_243.png  output -l tqChiTra --loglevel TRACE   -c 
> edges_debug=1   -c ambigs_debug_level=10   -c classify_debug_level=10   -c 
> dawg_debug_level=3   -c wordrec_debug_blamer=1   -c tessedit_dump_choices=1 
>   -c tessedit_debug_block_rejection=1   -c textord_noise_debug=1   -c 
> applybox_debug=10
>
>
> What page segmentation mode are you using? If you're using the default of 
> full automatic page segmentation (designed for pages of uniform text), it's 
> unlikely to work very well for closed captioning texts (a detail not 
> mentioned here, but included later in the thread).
>
> My test with the standard traditional Chinese model from tessdata gave 
> this result:
>
> tesseract image.png - -l chi_tra --psm 13
> 我 是 說 …
>
> I don't read Chinese, so there may be some subtle differences in the 
> characters, but they look pretty close to my eye.
>
> Tom
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/10a045d8-8d60-4c48-b909-dbc11a2fc3e2n%40googlegroups.com.

[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

Reply via email to