On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote: I have a problem where tesseract produces no output (zero byte output file) when presented with Chinese characters followed by either an ellipsis or three periods.
[image: bad_sub_243.png] If I crop the image in photoshop to remove the dots, the three Chinese characters are recognized perfectly. Feeding the image above, or feeding just the three dots, produces no output. I've just recompiled with the latest GIT version (see below). I've also re-trained the chi_tra model several times and added many words with the three dots to the wordlist. The result is the same with both. Any suggestions? *Command* tesseract bad_sub_243.png output -l tqChiTra --loglevel TRACE -c edges_debug=1 -c ambigs_debug_level=10 -c classify_debug_level=10 -c dawg_debug_level=3 -c wordrec_debug_blamer=1 -c tessedit_dump_choices=1 -c tessedit_debug_block_rejection=1 -c textord_noise_debug=1 -c applybox_debug=10 What page segmentation mode are you using? If you're using the default of full automatic page segmentation (designed for pages of uniform text), it's unlikely to work very well for closed captioning texts (a detail not mentioned here, but included later in the thread). My test with the standard traditional Chinese model from tessdata gave this result: tesseract image.png - -l chi_tra --psm 13 我 是 說 … I don't read Chinese, so there may be some subtle differences in the characters, but they look pretty close to my eye. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7d825916-9254-4b13-ba5f-d3eda80b795cn%40googlegroups.com.

