My target is to recognize Arabic with numbers and punctuation + English there are some English lines contain Arabic word and Some Arabic lines contain English word
i did some page layout analysis and split the text to lines and try to detect the language of each word depending on word geometry in the line and if i have line contain Arabic and English pass the line to English engine and Arabic engine then i select the final result depending on the confidence returned بتاريخ الأربعاء، 25 مارس، 2020 12:15:42 م UTC+2، كتب shree: > > The issue with Arabic is related to RTL processing and how punctuation and > digits are handled. If your training text does not have them, you will have > greater success. > > On Wed, Mar 25, 2020, 15:32 Essam Zaky <[email protected] <javascript:>> > wrote: > >> Thanx @Loranzo and @Shree >> i will give try to fine tune , and if the result still not satisfied >> will switch again to build from scratch >> >> بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky: >>> >>> Hi Dears , >>> >>> I would like to build *.traindata from scratch specially for English and >>> Arabic >>> >>> So lets talk about English as example >>> my question how to prepare fonts folder? >>> >>> i read the >>> https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh >>> >>> file >>> i found the this file contain about only 32 font name >>> should i add other Latin fonts installed in the training machine to the >>> previous file "language-specific.sh" ? >>> >>> >>> i used "font manger" tool and i found about 147 font installed in >>> training machine >>> i opended >>> https://github.com/tesseract-ocr/langdata_lstm/blob/master/eng/okfonts.txt >>> and it contain 4567 font name >>> should i search and download and install all missing fonts in the >>> training machine ? >>> >>> should i collect all fonts files from training machine and create new >>> fonts folder "HOME/.fonts" and paste all fonts in that folder? >>> >>> i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... " >>> does all font types work in training or i need specific type ? >>> >>> >>> I will write another question about the required text data . >>> >>> Thanks for help >>> >>> >>> >>> Regards >>> Essam >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3b8c2283-18a2-49a1-bc36-3fb70d1e3c76%40googlegroups.com.

