hi,zdenop My origin output of chcp is "936" As you said,I think it should be a problem with console coding.But i don't know how to solve this coding problem. In the end, I solved this problem in another way.I use software named "fontcreator" to modify the name of the fonts and changed the name to English.
在 2018年11月9日星期五 UTC+8下午4:44:41,zdenop写道: > > I want to know what is origin output of chcp;-) > > I think there are (at least) 2 issues: > > 1. encoding console problem (windows only - on linux it it correct) > 2. font related issue (at the moment I am not sure if font itself or > pango or text2image) > > Regarding 1.: > When I run: > text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% > --list_available_fonts > I got output: > 0: ĺtčż?ĺ'ŚéćĄ > 1: ĺşžä¸ĺŤŽčˇŚäą| Light > > When I set chcp 65001 result is still wrong: > 0: ĺ™čż 和酷楷 > 1: ĺşžä¸ĺŤŽčˇŚäঠLight > > When the output is redirected to file (text2image.exe --fonts_dir=i1252 > --fontconfig_tmpdir=%temp% --list_available_fonts >font_list.txt) font > names are correct: > 0: 孙运和酷楷 > 1: 庞中华行书 Light > > When I use "wrong console output" text2image is able to find and use font: > text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --text > i1252/chi_sim_test.txt --outputbase=chi_sim.test.exp0 --font="ĺ™čż > 和酷楷", but it crash the same way as on linux (issue 2) as described > in issue 1252: > ERROR: Illegal UTF8 encountered. > Index 0 char = 0xffffffa2 > Index 1 char = 0xffffffd2 > Index 2 char = 0xffffffd4 > Index 3 char = 0xd > Index 4 char = 0xa > WARNING: Illegal UTF8 encountered > > ** (text2image.exe:22496): WARNING **: 09:33:51.804: Invalid UTF-8 string > passed to pango_layout_set_text() > ** > ERROR:c:\users\zdeno\.cppan\storage\src\81\8f\8aa5\pango\pango-glyph-item.c:319:pango_glyph_item_iter_next_cluster: > > assertion failed: (iter->start_char < iter->end_char > > So one thing is to fix windows issue for correctly handling input/output > from/to console (BTW is it UTF-8 or UTF-16), but it will not solve issue > that these font are still not usable in text2image. > > Zdenko > > > pi 9. 11. 2018 o 7:33 bruce <[email protected] <javascript:>> napísal(a): > >> hi,Zdenko >> I have tried the command under two cmd window encodings(chcp 65001 >> and chcp 936). >> I got the same failure results. >> results as follows: >> [image: chcp936.png] >> [image: chcp65001.png] >> >> >> 在 2018年11月9日星期五 UTC+8上午5:03:00,zdenop写道: >>> >>> What is output of command "chcp" (in command line)? >>> >>> Zdenko >>> >>> >>> st 7. 11. 2018 o 2:55 bruce <[email protected]> napísal(a): >>> >>>> hi,zdenop ,thank you for your reply. >>>> my environment is: >>>> windows 7 professional 64bit >>>> tesseract version: >>>> https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe >>>> >>>> test train_txt: >>>> https://drive.google.com/open?id=1BfURsI_HdwaKeowZP0sa8L6GKWgIVDWJ >>>> >>>> test fonts : >>>> https://drive.google.com/open?id=1YZObeYWOzNZbkMTcrCNw3KVlYT7hn1Q6 >>>> >>>> https://drive.google.com/open?id=15C-v4ped8ssFGXW0pSKw6CMSQgW2s0WV >>>> >>>> >>>> I tried the fonts of all Chinese names.All got the same error >>>> message.and the link just two of these fonts. you can test . >>>> I guess the --fonts parameter doesn't support chinese character? >>>> >>>> 在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道: >>>>> >>>>> Hello, >>>>> >>>>> Please see bug-report and suggested solution: >>>>> https://github.com/tesseract-ocr/tesseract/issues/1252 >>>>> >>>>> I guess problem is in pango, but we would like to test it. Are you >>>>> able to create simple test case (provide small chi_sim.txt and share font >>>>> if it is possible) for this issue? >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> ut 6. 11. 2018 o 10:56 bruce <[email protected]> napísal(a): >>>>> >>>>>> I use the command as follows to find the fonts I can use to train my >>>>>> language. >>>>>> *text2image.exe --text=chi_sim.txt --outputbase=chi_sim.庞中华行书.exp0 >>>>>> --fints_dir=C:\Windows\Fonts --find_fonts* >>>>>> and i got the result as follows: >>>>>> Font MStiffHeiPRC >>>>>> failed with 414359 hits = 100.00% >>>>>> Font MStiffHeiPRC >>>>>> failed with 414359 hits = 100.00% >>>>>> Font MStiffHeiPRC >>>>>> failed with 414359 hits = 100.00% >>>>>> Font MStiffHeiPRC >>>>>> failed with 414359 hits = 100.00% >>>>>> Font MStream PRC >>>>>> failed with 414359 hits = 100.00% >>>>>> Font MSung PRC failed >>>>>> with 414359 hits = 100.00% >>>>>> Font MSung PRC failed >>>>>> with 414359 hits = 100.00% >>>>>> 庞中华行书 Light : 414361 >>>>>> hits = 100.00%, raw = 3440 = 100.00% >>>>>> Font 剑客毛笔行书 failed >>>>>> with 414357 hits = 100.00% >>>>>> Font 可可漫雪体 failed >>>>>> with 414360 hits = 100.00% >>>>>> Font 多米手写体 failed >>>>>> with 414253 hits = 99.97% >>>>>> Font 字体中国-锐博体V1 >>>>>> failed with 414359 hits = 100.00% >>>>>> Font 孙运和酷楷 failed >>>>>> with 414359 hits = 100.00% >>>>>> Font 建刚静心楷 failed >>>>>> with 414359 hits = 100.00% >>>>>> Font 张维镜手写楷书 Medium >>>>>> failed with 410014 hits = 98.95% >>>>>> Font 徐金如硬笔行楷X failed >>>>>> with 413042 hits = 99.68% >>>>>> >>>>>> >>>>>> >>>>>> Than I use command like this:*text2image.exe --text=chi_sim.txt >>>>>> --outputbase=chi_sim.庞中华行书.exp0 --ptsize 36 --font "庞中华行书" --fonts_dir >>>>>> C:\Windows\Fonts* >>>>>> I got an error resut as follows: >>>>>> Could not find font >>>>>> named '庞中华行书'. >>>>>> Pango suggested font >>>>>> 'MingLiU'. >>>>>> Please correct --font >>>>>> arg. >>>>>> >>>>>> text2image not support chinese name fonts?How could i use these >>>>>> chinese name fonts? >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ca247e3-708c-4956-bedf-b8fbb586f10a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

