Re: [tesseract-ocr] run text2image failed ,text2image not support chinese name fonts?

Zdenko Podobny Fri, 09 Nov 2018 00:45:02 -0800

I want to know what is origin output of chcp;-)

I think there are (at least) 2 issues:


   1. encoding console problem (windows only - on linux it it correct)
   2. font related issue (at the moment I am not sure if font itself or
   pango or text2image)

Regarding 1.:
When I run:
 text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp%
--list_available_fonts
I got output:
  0: ĺtčż?ĺ'ŚéćĄ
  1: ĺşžä¸ĺŤŽčˇŚäą| Light

When I set chcp 65001 result is still wrong:
  0: ĺ™čż ĺ’Śé…·ćĄ·
  1: ĺşžä¸ĺŤŽčˇŚäą¦ Light

When the output is redirected to file (text2image.exe --fonts_dir=i1252
--fontconfig_tmpdir=%temp% --list_available_fonts >font_list.txt) font
names are correct:
  0: 孙运和酷楷
  1: 庞中华行书 Light

When I use "wrong console output" text2image is able to find and use font:
text2image.exe --fonts_dir=i1252  --fontconfig_tmpdir=%temp% --text
i1252/chi_sim_test.txt --outputbase=chi_sim.test.exp0 --font="ĺ™čż
ĺ’Śé…·ćĄ·", but it crash the same way as on linux (issue 2) as described in
issue 1252:
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffffa2
Index 1 char = 0xffffffd2
Index 2 char = 0xffffffd4
Index 3 char = 0xd
Index 4 char = 0xa
WARNING: Illegal UTF8 encountered

** (text2image.exe:22496): WARNING **: 09:33:51.804: Invalid UTF-8 string
passed to pango_layout_set_text()
**
ERROR:c:\users\zdeno\.cppan\storage\src\81\8f\8aa5\pango\pango-glyph-item.c:319:pango_glyph_item_iter_next_cluster:
assertion failed: (iter->start_char < iter->end_char

So one thing is to fix windows issue for correctly handling input/output
from/to console (BTW is it UTF-8 or UTF-16), but it will not solve issue
that these font are still not usable in text2image.

 Zdenko


pi 9. 11. 2018 o 7:33 bruce <[email protected]> napísal(a):

> hi，Zdenko
>    I have tried the command under two cmd window encodings（chcp 65001 and
> chcp 936）.
>    I got the same failure results.
>    results as follows:
> [image: chcp936.png]
> [image: chcp65001.png]
>
>
> 在 2018年11月9日星期五 UTC+8上午5:03:00，zdenop写道：
>>
>> What is output of command "chcp" (in command line)?
>>
>> Zdenko
>>
>>
>> st 7. 11. 2018 o 2:55 bruce <[email protected]> napísal(a):
>>
>>> hi,zdenop ,thank you for your reply.
>>> my environment is:
>>>                              windows 7 professional 64bit
>>>                              tesseract version:
>>> https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe
>>>
>>> test train_txt:
>>> https://drive.google.com/open?id=1BfURsI_HdwaKeowZP0sa8L6GKWgIVDWJ
>>>
>>> test fonts :
>>> https://drive.google.com/open?id=1YZObeYWOzNZbkMTcrCNw3KVlYT7hn1Q6
>>>
>>> https://drive.google.com/open?id=15C-v4ped8ssFGXW0pSKw6CMSQgW2s0WV
>>>
>>>
>>> I tried the fonts of all Chinese names.All got the same error
>>> message.and the link just two of these fonts. you can test .
>>> I guess the --fonts parameter doesn't support chinese character?
>>>
>>> 在 2018年11月6日星期二 UTC+8下午6:11:00，zdenop写道：
>>>>
>>>> Hello,
>>>>
>>>> Please see bug-report and suggested solution:
>>>> https://github.com/tesseract-ocr/tesseract/issues/1252
>>>>
>>>> I guess problem is in pango, but we would like to test it. Are you able
>>>> to create simple test case (provide small chi_sim.txt and share font if it
>>>> is possible) for this issue?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> ut 6. 11. 2018 o 10:56 bruce <[email protected]> napísal(a):
>>>>
>>>>> I use the command as follows to find the fonts I can use to train my
>>>>> language.
>>>>> *text2image.exe --text=chi_sim.txt --outputbase=chi_sim.庞中华行书.exp0
>>>>> --fints_dir=C:\Windows\Fonts --find_fonts*
>>>>> and i got the result as follows:
>>>>>                                                 Font MStiffHeiPRC
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font MStiffHeiPRC
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font MStiffHeiPRC
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font MStiffHeiPRC
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font MStream PRC
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font MSung PRC failed
>>>>> with 414359 hits = 100.00%
>>>>>                                                 Font MSung PRC failed
>>>>> with 414359 hits = 100.00%
>>>>>                                                 庞中华行书 Light : 414361
>>>>> hits = 100.00%, raw = 3440 = 100.00%
>>>>>                                                 Font 剑客毛笔行书 failed
>>>>> with 414357 hits = 100.00%
>>>>>                                                 Font 可可漫雪体 failed
>>>>> with 414360 hits = 100.00%
>>>>>                                                 Font 多米手写体 failed
>>>>> with 414253 hits = 99.97%
>>>>>                                                 Font 字体中国-锐博体V1
>>>>> failed with 414359 hits = 100.00%
>>>>>                                                 Font 孙运和酷楷 failed
>>>>> with 414359 hits = 100.00%
>>>>>                                                 Font 建刚静心楷 failed
>>>>> with 414359 hits = 100.00%
>>>>>                                                 Font 张维镜手写楷书 Medium
>>>>> failed with 410014 hits = 98.95%
>>>>>                                                 Font 徐金如硬笔行楷X failed
>>>>> with 413042 hits = 99.68%
>>>>>
>>>>>
>>>>>
>>>>> Than I use command like this:*text2image.exe --text=chi_sim.txt
>>>>> --outputbase=chi_sim.庞中华行书.exp0 --ptsize 36 --font "庞中华行书" --fonts_dir
>>>>> C:\Windows\Fonts*
>>>>> I got an error resut as follows:
>>>>>                                                Could not find font
>>>>> named '庞中华行书'.
>>>>>                                                Pango suggested font
>>>>> 'MingLiU'.
>>>>>                                                Please correct --font
>>>>> arg.
>>>>>
>>>>> text2image not support chinese name fonts?How could i use these
>>>>> chinese name fonts?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zZ1UvC%2BOcYaae6ky%2BYom0PWBcBhM7Cg33X6Nk3t7QASA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] run text2image failed ,text2image not support chinese name fonts?

Reply via email to