On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote:

I have a problem where tesseract produces no output (zero byte output file) 
when presented with Chinese characters followed by either an ellipsis or 
three periods.

[image: bad_sub_243.png]

If I crop the image in photoshop to remove the dots, the three Chinese 
characters are recognized perfectly. Feeding the image above, or feeding 
just the three dots, produces no output.

I've just recompiled with the latest GIT version (see below).  I've also 
re-trained the chi_tra model several times and added many words with the 
three dots to the wordlist. The result is the same with both.

Any suggestions?

*Command*
tesseract bad_sub_243.png  output -l tqChiTra --loglevel TRACE   -c 
edges_debug=1   -c ambigs_debug_level=10   -c classify_debug_level=10   -c 
dawg_debug_level=3   -c wordrec_debug_blamer=1   -c tessedit_dump_choices=1 
  -c tessedit_debug_block_rejection=1   -c textord_noise_debug=1   -c 
applybox_debug=10


What page segmentation mode are you using? If you're using the default of 
full automatic page segmentation (designed for pages of uniform text), it's 
unlikely to work very well for closed captioning texts (a detail not 
mentioned here, but included later in the thread).

My test with the standard traditional Chinese model from tessdata gave this 
result:

tesseract image.png - -l chi_tra --psm 13
我 是 說 …

I don't read Chinese, so there may be some subtle differences in the 
characters, but they look pretty close to my eye.

Tom
 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7d825916-9254-4b13-ba5f-d3eda80b795cn%40googlegroups.com.

Reply via email to