Hi. I tried new tesseract and traineddata for Japanese (both jpn.traineddata and Japanese.traineddata).
It's very good recognition result with jpn.traineddata. Japanese.traineddata provide good result but unnecessary space is inserted in words or characters. Is this behavior expected? In Japanese, there is no space between each words. If this behavior is expected, what kind of usage is assumed for Japanese.traineddata? jpn.traineddata (very good, and I expected): --- start --- $ tesseract -l jpn test_jpn_04.jpg stdout Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 168 OCR 機能を提供する Web API はいくつか存在しますが、用途によってカスタマイズすることが できません。Tesseract は多数の言語に対応し、Linux、macOS、Windows で動作します。 --- end --- Japanese.traineddata: --- start --- $ tesseract -l Japanese test_jpn_04.jpg stdout Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 168 OCR 機能 を 提供 する Web API は いく つか 存在 し ます が 、 用 途 に よっ て カス タマ イズ する こと が で きま せん 。Tesseract は 多数 の 言語 に 対応 し 、Linux、macOS、Windows で 動作 し ます 。 --- end --- This result is same between Ubuntu (beta.1) and macOS (4.0.0-beta.2-586-g607e). Thanks. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccfcb61b-3afa-4ecc-b6ac-ae3aebc55465%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

