The issue with Arabic is related to RTL processing and how punctuation and digits are handled. If your training text does not have them, you will have greater success.
On Wed, Mar 25, 2020, 15:32 Essam Zaky <[email protected]> wrote: > Thanx @Loranzo and @Shree > i will give try to fine tune , and if the result still not satisfied will > switch again to build from scratch > > بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky: >> >> Hi Dears , >> >> I would like to build *.traindata from scratch specially for English and >> Arabic >> >> So lets talk about English as example >> my question how to prepare fonts folder? >> >> i read the >> https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh >> file >> i found the this file contain about only 32 font name >> should i add other Latin fonts installed in the training machine to the >> previous file "language-specific.sh" ? >> >> >> i used "font manger" tool and i found about 147 font installed in >> training machine >> i opended >> https://github.com/tesseract-ocr/langdata_lstm/blob/master/eng/okfonts.txt >> and it contain 4567 font name >> should i search and download and install all missing fonts in the >> training machine ? >> >> should i collect all fonts files from training machine and create new >> fonts folder "HOME/.fonts" and paste all fonts in that folder? >> >> i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... " >> does all font types work in training or i need specific type ? >> >> >> I will write another question about the required text data . >> >> Thanks for help >> >> >> >> Regards >> Essam >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVV%3D%2B2x_KTyzhPL-D25rORuvM%2BjOOZbCQ%2BOAXQFufK3vQ%40mail.gmail.com.

