I want to know what is origin output of chcp;-) I think there are (at least) 2 issues:
1. encoding console problem (windows only - on linux it it correct) 2. font related issue (at the moment I am not sure if font itself or pango or text2image) Regarding 1.: When I run: text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --list_available_fonts I got output: 0: ĺtčż?ĺ'ŚéćĄ 1: ĺşžä¸ĺŤŽčˇŚäą| Light When I set chcp 65001 result is still wrong: 0: ĺ™čż 和酷楷 1: ĺşžä¸ĺŤŽčˇŚäঠLight When the output is redirected to file (text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --list_available_fonts >font_list.txt) font names are correct: 0: 孙运和酷楷 1: 庞中华行书 Light When I use "wrong console output" text2image is able to find and use font: text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --text i1252/chi_sim_test.txt --outputbase=chi_sim.test.exp0 --font="ĺ™čż 和酷楷", but it crash the same way as on linux (issue 2) as described in issue 1252: ERROR: Illegal UTF8 encountered. Index 0 char = 0xffffffa2 Index 1 char = 0xffffffd2 Index 2 char = 0xffffffd4 Index 3 char = 0xd Index 4 char = 0xa WARNING: Illegal UTF8 encountered ** (text2image.exe:22496): WARNING **: 09:33:51.804: Invalid UTF-8 string passed to pango_layout_set_text() ** ERROR:c:\users\zdeno\.cppan\storage\src\81\8f\8aa5\pango\pango-glyph-item.c:319:pango_glyph_item_iter_next_cluster: assertion failed: (iter->start_char < iter->end_char So one thing is to fix windows issue for correctly handling input/output from/to console (BTW is it UTF-8 or UTF-16), but it will not solve issue that these font are still not usable in text2image. Zdenko pi 9. 11. 2018 o 7:33 bruce <[email protected]> napísal(a): > hi,Zdenko > I have tried the command under two cmd window encodings(chcp 65001 and > chcp 936). > I got the same failure results. > results as follows: > [image: chcp936.png] > [image: chcp65001.png] > > > 在 2018年11月9日星期五 UTC+8上午5:03:00,zdenop写道: >> >> What is output of command "chcp" (in command line)? >> >> Zdenko >> >> >> st 7. 11. 2018 o 2:55 bruce <[email protected]> napísal(a): >> >>> hi,zdenop ,thank you for your reply. >>> my environment is: >>> windows 7 professional 64bit >>> tesseract version: >>> https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe >>> >>> test train_txt: >>> https://drive.google.com/open?id=1BfURsI_HdwaKeowZP0sa8L6GKWgIVDWJ >>> >>> test fonts : >>> https://drive.google.com/open?id=1YZObeYWOzNZbkMTcrCNw3KVlYT7hn1Q6 >>> >>> https://drive.google.com/open?id=15C-v4ped8ssFGXW0pSKw6CMSQgW2s0WV >>> >>> >>> I tried the fonts of all Chinese names.All got the same error >>> message.and the link just two of these fonts. you can test . >>> I guess the --fonts parameter doesn't support chinese character? >>> >>> 在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道: >>>> >>>> Hello, >>>> >>>> Please see bug-report and suggested solution: >>>> https://github.com/tesseract-ocr/tesseract/issues/1252 >>>> >>>> I guess problem is in pango, but we would like to test it. Are you able >>>> to create simple test case (provide small chi_sim.txt and share font if it >>>> is possible) for this issue? >>>> >>>> Zdenko >>>> >>>> >>>> ut 6. 11. 2018 o 10:56 bruce <[email protected]> napísal(a): >>>> >>>>> I use the command as follows to find the fonts I can use to train my >>>>> language. >>>>> *text2image.exe --text=chi_sim.txt --outputbase=chi_sim.庞中华行书.exp0 >>>>> --fints_dir=C:\Windows\Fonts --find_fonts* >>>>> and i got the result as follows: >>>>> Font MStiffHeiPRC >>>>> failed with 414359 hits = 100.00% >>>>> Font MStiffHeiPRC >>>>> failed with 414359 hits = 100.00% >>>>> Font MStiffHeiPRC >>>>> failed with 414359 hits = 100.00% >>>>> Font MStiffHeiPRC >>>>> failed with 414359 hits = 100.00% >>>>> Font MStream PRC >>>>> failed with 414359 hits = 100.00% >>>>> Font MSung PRC failed >>>>> with 414359 hits = 100.00% >>>>> Font MSung PRC failed >>>>> with 414359 hits = 100.00% >>>>> 庞中华行书 Light : 414361 >>>>> hits = 100.00%, raw = 3440 = 100.00% >>>>> Font 剑客毛笔行书 failed >>>>> with 414357 hits = 100.00% >>>>> Font 可可漫雪体 failed >>>>> with 414360 hits = 100.00% >>>>> Font 多米手写体 failed >>>>> with 414253 hits = 99.97% >>>>> Font 字体中国-锐博体V1 >>>>> failed with 414359 hits = 100.00% >>>>> Font 孙运和酷楷 failed >>>>> with 414359 hits = 100.00% >>>>> Font 建刚静心楷 failed >>>>> with 414359 hits = 100.00% >>>>> Font 张维镜手写楷书 Medium >>>>> failed with 410014 hits = 98.95% >>>>> Font 徐金如硬笔行楷X failed >>>>> with 413042 hits = 99.68% >>>>> >>>>> >>>>> >>>>> Than I use command like this:*text2image.exe --text=chi_sim.txt >>>>> --outputbase=chi_sim.庞中华行书.exp0 --ptsize 36 --font "庞中华行书" --fonts_dir >>>>> C:\Windows\Fonts* >>>>> I got an error resut as follows: >>>>> Could not find font >>>>> named '庞中华行书'. >>>>> Pango suggested font >>>>> 'MingLiU'. >>>>> Please correct --font >>>>> arg. >>>>> >>>>> text2image not support chinese name fonts?How could i use these >>>>> chinese name fonts? >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/63e4ef0a-7754-4ee8-ad8f-7f95dcfef718%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zZ1UvC%2BOcYaae6ky%2BYom0PWBcBhM7Cg33X6Nk3t7QASA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

