Re: [tesseract-ocr] Re: How to prepare fonts folder to train from scratch

Essam Zaky Wed, 25 Mar 2020 03:34:36 -0700

 My target is to recognize Arabic with numbers and punctuation + English
 there are some English lines contain Arabic word
and Some Arabic lines contain English word


i did some page layout analysis and split the text to lines and try to 
detect the language of each word depending on word geometry in the line 
and if i have line contain Arabic and English pass the line to English 
engine  and Arabic engine then i select the final result depending on the 
confidence  returned 
بتاريخ الأربعاء، 25 مارس، 2020 12:15:42 م UTC+2، كتب shree:
>
> The issue with Arabic is related to RTL processing and how punctuation and 
> digits are handled. If your training text does not have them, you will have 
> greater success. 
>
> On Wed, Mar 25, 2020, 15:32 Essam Zaky <[email protected] <javascript:>> 
> wrote:
>
>> Thanx @Loranzo and @Shree
>>  i will give try to fine tune , and if the result still not satisfied 
>> will switch again to build from scratch
>>
>> بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:
>>>
>>> Hi Dears ,
>>>
>>> I would like to build *.traindata from scratch specially for English and 
>>> Arabic
>>>
>>> So lets talk about English as example
>>> my question how to prepare fonts folder? 
>>>
>>> i read the 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh
>>>  
>>> file
>>> i found the this file contain about only 32 font name 
>>> should i add other Latin fonts installed in the training  machine to the 
>>> previous file "language-specific.sh" ?
>>>
>>>
>>> i used "font manger" tool and i found about 147 font installed in 
>>> training machine 
>>> i opended 
>>> https://github.com/tesseract-ocr/langdata_lstm/blob/master/eng/okfonts.txt 
>>> and it contain 4567 font name
>>> should i search and download and install all missing fonts in the 
>>> training machine ?
>>>
>>> should i collect all fonts files from training machine and create new 
>>> fonts folder "HOME/.fonts" and paste all fonts in that folder? 
>>>
>>> i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... "
>>> does all font types work in training or i need specific type ?
>>>
>>>
>>> I will write another question about the required text data .  
>>>
>>> Thanks for help
>>>
>>>
>>>
>>> Regards
>>> Essam
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3b8c2283-18a2-49a1-bc36-3fb70d1e3c76%40googlegroups.com.

Re: [tesseract-ocr] Re: How to prepare fonts folder to train from scratch

Reply via email to