Have you tried running this through a multi-model=multi-language tesseract,
e.g. -lang chi+eng ?

The idea behind this question is: using dots as periods (and sets of dots
serving as ellipsis) is something that's particular to euro languages,
mostly, while Chinese writing uses other means to signal end of sentence
(sometimes you see a circle serving as period, f.e.)

While (1) we don't know the training details for the models Ray Smith
produced at Google and subsequently published, my bet is the period 'dot'
and ... ellipsis symbols did not feature heavily in the Chinese training
set (possibly not at all, though that can be checked by inspecting the
charset that's defined as part of the training data file), and (2) yes, I
see the tesseract internal preprocessing (binarization, noise reduction,
...) stages have a hard time dealing with noisy images which are human-eye
perceptionally 'clean' (Jpeg input images and such, e.g. camera- and
video-screengrabs, which (nearly) always have traveled through some
(hidden) mpeg/jpeg/similar lossy compression stage), my own tests indicate
that the preprocessor may have detected the ellipsis and included it as
part of the line image, but it MAY be that the subsequent OCR recog stage
dropped these due to ratings that turned out too low.
Meanwhile, English, Latin, etc have a much better chance at observing
periods and ranking them as highly probable 'period' characters as these
symbols must have featured more heavily in their training set by necessity,
so it may be useful to run English or Latin or a similar euro language
model as a secondary language in order to give tesseract some higher
rankings for those dot pixel lines to work with...





(More on tesseract and image noise + text bounding boxes in the next couple
of weeks but I'm trying to organize that research as it kinda exploded in
my face: instead of one issue, it is several and none of them easy to fix
or circumnavigate)


On Fri, 2 Aug 2024, 11:13 'Danny' via tesseract-ocr, <
[email protected]> wrote:

> Can any one suggest some debug settings I can activate to try to trace
> down why I'm getting no output?
> Thanks
> Danny
>
> On Tuesday, July 30, 2024 at 8:23:38 PM UTC+8 Danny wrote:
>
>> I have a problem where tesseract produces no output (zero byte output
>> file) when presented with Chinese characters followed by either an ellipsis
>> or three periods.
>>
>> [image: bad_sub_243.png]
>>
>> If I crop the image in photoshop to remove the dots, the three Chinese
>> characters are recognized perfectly. Feeding the image above, or feeding
>> just the three dots, produces no output.
>>
>> I've just recompiled with the latest GIT version (see below).  I've also
>> re-trained the chi_tra model several times and added many words with the
>> three dots to the wordlist. The result is the same with both.
>>
>> Any suggestions?
>>
>> *Command*
>> tesseract bad_sub_243.png  output -l tqChiTra --loglevel TRACE   -c
>> edges_debug=1   -c ambigs_debug_level=10   -c classify_debug_level=10   -c
>> dawg_debug_level=3   -c wordrec_debug_blamer=1   -c tessedit_dump_choices=1
>>   -c tessedit_debug_block_rejection=1   -c textord_noise_debug=1   -c
>> applybox_debug=10
>>
>> *Messages*
>> Warning: Parameter not found: language_model_ngram_on
>> Warning: Parameter not found: segsearch_max_char_wh_ratio
>> Warning: Parameter not found:
>> language_model_ngram_space_delimited_language
>> Warning: Parameter not found: language_model_use_sigmoidal_certainty
>> Warning: Parameter not found: language_model_ngram_nonmatch_score
>> Warning: Parameter not found: classify_integer_matcher_multiplier
>> Warning: Parameter not found: assume_fixed_pitch_char_segment
>> Warning: Parameter not found: allow_blob_division
>> Warning: Parameter not found: segsearch_max_char_wh_ratio
>> Warning: Parameter not found:
>> language_model_ngram_space_delimited_language
>> Warning: Parameter not found: language_model_use_sigmoidal_certainty
>> Warning: Parameter not found: language_model_ngram_nonmatch_score
>> Warning: Parameter not found: classify_integer_matcher_multiplier
>> Warning: Parameter not found: assume_fixed_pitch_char_segment
>> Warning: Parameter not found: allow_blob_division
>> Estimating resolution as 675
>> Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED
>> cleanup_blocks: # rows = 0 / 1
>> cleanup_blocks: # blocks = 0 / 1
>> Estimating resolution as 675
>> Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED
>> cleanup_blocks: # rows = 0 / 1
>> cleanup_blocks: # blocks = 0 / 1
>>
>> *Version*
>> # tesseract --version
>> tesseract 5.4.1-11-g46b9
>>  leptonica-1.76.0
>>   libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 :
>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0
>>  Found AVX
>>  Found SSE4.1
>>  Found OpenMP 201511
>>  Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6
>> liblz4/1.8.1
>>  Found libcurl/7.61.1 OpenSSL/1.1.1c zlib/1.2.11 brotli/1.0.6
>> libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib
>> nghttp2/1.33.0
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/11209fd7-65f6-49d1-8153-ae217db71e85n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/11209fd7-65f6-49d1-8153-ae217db71e85n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foj9OrJuf2UcWuf7zwZP0ZFJ9jA%3D%2BHCYW2J%2B_r%2B5KSy-A%40mail.gmail.com.

Reply via email to