AFAIK all the language models are trained from scratch. 

In my experience the error rate is significantly higher on names, e.g. 
scientific names in botany which mostly are some sort of latinised Greek. 
Same for names of persons if they fit not into the main language (or any 
language at all) of the model.

Thus I guess that the recognition can be improved by additional training 
with wordlists or texts of a special domain (Domain is the usual linguistic 
term for text classification like poems, drama, news, science, tech, etc.)

akm schrieb am Dienstag, 27. Juli 2021 um 18:23:19 UTC+2:

> I would like to add one more question, were the other Latin languages, 
> such as French trained from scratch or just fine-tuned the English language?
>
> On Saturday, July 24, 2021 at 11:25:12 PM UTC-4 akm wrote:
>
>> Hi,
>>
>> I am trying to follow the TessTutorial to train tesseract from scratch. I 
>> have some questions regarding the lang data to understand how the training 
>> is working.
>>
>> The provided training text has some random English words. The questions 
>> regarding the training text:
>>
>> 1- Is using text from some scope will improve the performance of 
>> tesseract on that scope? For example, training tesseract on special names 
>> or vocabs that are not English but has Latin letters and numbers (a-z A-Z 
>> 0-9 and special chars). Example: pH_scale1
>>
>> 2 - Is generating words from random letters will do the same as using 
>> English words?
>> The provided eng.trainingtext has text such as :
>> "different New Articles page 23 a To Service ~~ a details DC that don't 
>> as 7 «« Date:"
>>
>> What if I use something random like this:
>> "sqwrLwU2bo BLiRDhvAoM USyWtpBFi5 UwLgXyoz1e UqiXudhrhz dDKAdnI8Z2 
>> YIl6T6d7m6 G2IVtTRbuu Lh6NvWNLc3 CGD2SXOoNT"
>>  
>>
>> Thanks
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44aefa35-822f-4cd4-b83f-9eed16eabc73n%40googlegroups.com.

Reply via email to