I used chi_tra because the document is in Traditional Chinese.

shree於 2018年1月7日星期日 UTC+8下午2時55分13秒寫道:
>
> Have you tried with chi_sim which has both chinese and english?
>
> On 07-Jan-2018 12:03 PM, "林博仁" <[email protected] <javascript:>> wrote:
>
>> I unable to extract a document with Chinese characters properly, please 
>> help.
>> Input File
>>
>> https://drive.google.com/file/d/16j21iuXVwrxplGGtJZhxeTf0ziXPYD0d/view?usp=sharing
>> It's a scanned TIFF in 600DPI, it is generated using the EPSON's iscan 
>> utility
>> Tesseract Commandline
>> tesseract input.tiff output.txt -l chi_tra+eng
>> Expected Behavior
>> Chinese characters are recognized in output.txt
>> Current Behavior
>> Only English characters are recognized, Chinese characters are missing 
>> with blank substituted, like this <https://paste2.org/gzkPkY1D>
>>
>> Tesseract Source
>> ```````````````````
>> commit 000d027a9f40e17c9a90a907fa9e4a16616e61a0
>> Author: Egor Pugin <[email protected] <javascript:>>
>> Date:   Fri Jan 5 18:51:35 2018 +0300
>>
>>     Rename tesseract library.
>> ```````````````````
>>
>> `````````````````
>> tesseract 4.00.00alpha
>>  leptonica-1.74.4
>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : 
>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>>
>>  Found AVX
>>  Found SSE
>> `````````````````
>> Operating System
>> KDE neon 5.11 (based on Ubuntu 16.04)
>>
>> tessdata
>> commit f1d1268 
>> <https://github.com/tesseract-ocr/tessdata_best/commit/f1d12682c0f1afe61db892f4b2bfaa7909ad7a59>
>>  
>> from tessdata_best
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/02afd731-a74a-417a-a42c-6ee4f41b3a63%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/02afd731-a74a-417a-a42c-6ee4f41b3a63%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7d3b27dd-2a95-4dbf-bc33-ff3a498ba973%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to