Hi Tom,
Thanks for the suggestion!
We've been using PSM 6 (Assume a single uniform block of text) and, for
that input image, it outputs nothing for both the stock chi_tra.traineddata
and our in-house trained data file.
However... I just tried PSM 13 ("Raw line. Treat the image as a single text
line, bypassing hacks that are Tesseract-specific) and do get some output!
WIth chi_tra: same as you got: 我 是 說 …
With in house model: 我是說...」
The characters themselves are correct. The stock chi_tra model puts an
extra space after each character. I recall reading a bug report about that
somewhere.
The spacing with our model is better but it adds an extraneous closing
square-quote. Another difference (not so significant) is that the stock
model outputs an ellipsis character while the in-house model outputs three
periods.
However, once in a while the subtitle image has two lines of text, which is
why we chose PSM 6.
[image: multiline_sub_16.png]
I tried the image above with PSM 13 and the and unfortunately it failed
with both the stock chi_tra and our in-house model: m 論
Using PSM 6 works (but again chi_tra adds the extra spaces. Our in-house is
better)
So, I'm thinking the issue is with the preprocessing, segmentation, and
glyph identification more than the model itself.
On Tuesday, August 6, 2024 at 2:09:32 AM UTC+8 [email protected] wrote:
> On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote:
>
> I have a problem where tesseract produces no output (zero byte output
> file) when presented with Chinese characters followed by either an ellipsis
> or three periods.
>
> [image: bad_sub_243.png]
>
> If I crop the image in photoshop to remove the dots, the three Chinese
> characters are recognized perfectly. Feeding the image above, or feeding
> just the three dots, produces no output.
>
> I've just recompiled with the latest GIT version (see below). I've also
> re-trained the chi_tra model several times and added many words with the
> three dots to the wordlist. The result is the same with both.
>
> Any suggestions?
>
> *Command*
> tesseract bad_sub_243.png output -l tqChiTra --loglevel TRACE -c
> edges_debug=1 -c ambigs_debug_level=10 -c classify_debug_level=10 -c
> dawg_debug_level=3 -c wordrec_debug_blamer=1 -c tessedit_dump_choices=1
> -c tessedit_debug_block_rejection=1 -c textord_noise_debug=1 -c
> applybox_debug=10
>
>
> What page segmentation mode are you using? If you're using the default of
> full automatic page segmentation (designed for pages of uniform text), it's
> unlikely to work very well for closed captioning texts (a detail not
> mentioned here, but included later in the thread).
>
> My test with the standard traditional Chinese model from tessdata gave
> this result:
>
> tesseract image.png - -l chi_tra --psm 13
> 我 是 說 …
>
> I don't read Chinese, so there may be some subtle differences in the
> characters, but they look pretty close to my eye.
>
> Tom
>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/10a045d8-8d60-4c48-b909-dbc11a2fc3e2n%40googlegroups.com.