[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

Tom Morris Mon, 05 Aug 2024 11:09:37 -0700

On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote:

I have a problem where tesseract produces no output (zero byte output file) 
when presented with Chinese characters followed by either an ellipsis or 
three periods.

[image: bad_sub_243.png]

If I crop the image in photoshop to remove the dots, the three Chinese
characters are recognized perfectly. Feeding the image above, or feeding
just the three dots, produces no output.

I've just recompiled with the latest GIT version (see below). I've also
re-trained the chi_tra model several times and added many words with the
three dots to the wordlist. The result is the same with both.

Any suggestions?

*Command*
tesseract bad_sub_243.png output -l tqChiTra --loglevel TRACE -c
edges_debug=1 -c ambigs_debug_level=10 -c classify_debug_level=10 -c
dawg_debug_level=3 -c wordrec_debug_blamer=1 -c tessedit_dump_choices=1
-c tessedit_debug_block_rejection=1 -c textord_noise_debug=1 -c
applybox_debug=10

What page segmentation mode are you using? If you're using the default of
full automatic page segmentation (designed for pages of uniform text), it's
unlikely to work very well for closed captioning texts (a detail not
mentioned here, but included later in the thread).

My test with the standard traditional Chinese model from tessdata gave this
result:

tesseract image.png - -l chi_tra --psm 13
我 是 說 …

I don't read Chinese, so there may be some subtle differences in the
characters, but they look pretty close to my eye.

Tom

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/7d825916-9254-4b13-ba5f-d3eda80b795cn%40googlegroups.com.

[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

Reply via email to