I used chi_tra because the document is in Traditional Chinese. shree於 2018年1月7日星期日 UTC+8下午2時55分13秒寫道: > > Have you tried with chi_sim which has both chinese and english? > > On 07-Jan-2018 12:03 PM, "林博仁" <[email protected] <javascript:>> wrote: > >> I unable to extract a document with Chinese characters properly, please >> help. >> Input File >> >> https://drive.google.com/file/d/16j21iuXVwrxplGGtJZhxeTf0ziXPYD0d/view?usp=sharing >> It's a scanned TIFF in 600DPI, it is generated using the EPSON's iscan >> utility >> Tesseract Commandline >> tesseract input.tiff output.txt -l chi_tra+eng >> Expected Behavior >> Chinese characters are recognized in output.txt >> Current Behavior >> Only English characters are recognized, Chinese characters are missing >> with blank substituted, like this <https://paste2.org/gzkPkY1D> >> >> Tesseract Source >> ``````````````````` >> commit 000d027a9f40e17c9a90a907fa9e4a16616e61a0 >> Author: Egor Pugin <[email protected] <javascript:>> >> Date: Fri Jan 5 18:51:35 2018 +0300 >> >> Rename tesseract library. >> ``````````````````` >> >> ````````````````` >> tesseract 4.00.00alpha >> leptonica-1.74.4 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : >> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0 >> >> Found AVX >> Found SSE >> ````````````````` >> Operating System >> KDE neon 5.11 (based on Ubuntu 16.04) >> >> tessdata >> commit f1d1268 >> <https://github.com/tesseract-ocr/tessdata_best/commit/f1d12682c0f1afe61db892f4b2bfaa7909ad7a59> >> >> from tessdata_best >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/02afd731-a74a-417a-a42c-6ee4f41b3a63%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/02afd731-a74a-417a-a42c-6ee4f41b3a63%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7d3b27dd-2a95-4dbf-bc33-ff3a498ba973%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

