[tesseract-ocr] Unnecessary extra space with Japanese.traineddata

Atsuyoshi Suzuki Mon, 23 Jul 2018 20:31:53 -0700

Hi.

I tried new tesseract and  traineddata for Japanese (both jpn.traineddata 
and Japanese.traineddata).


It's very good recognition result with jpn.traineddata.

Japanese.traineddata provide good result  but unnecessary space is inserted 
in words or characters.



Is this behavior expected? In Japanese, there is no space between each 
words.

If this behavior is expected, what kind of usage is assumed for 
Japanese.traineddata?



jpn.traineddata (very good, and I expected):

--- start ---
$ tesseract -l jpn  test_jpn_04.jpg stdout
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 168
OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが
できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。

--- end ---


Japanese.traineddata:

--- start ---
$ tesseract -l Japanese  test_jpn_04.jpg stdout
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 168
OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が
で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。

--- end ---


This result is same between Ubuntu (beta.1) and macOS 
(4.0.0-beta.2-586-g607e).



Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Unnecessary extra space with Japanese.traineddata

Reply via email to