I think fine tuning may work very well in this case, no need to train from
scratch. Training from scratch does not guarantee better results,
especially if you don't do it correctly.

I suggest to try fine tuning first and see if the results are good enough
for you. In this way you get comfortable with the training process.

Training from scratch is just the same thing but more difficult because you
will see the results after many hours or days and if you messed up
something you need to start over. You also need to change the learning rate
during training and monitor the training curves. I think there is not a
simple recipe.

If you want to preserve what the model learned so far as much as possible
you can try two things:

1. fine tune with the new fonts and the old fonts (or similar ones).

2. try this:
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#training-just-a-few-layers

I recommend the option 1 first, make it work correctly, then try option 2
and see if it makes things better.

Just make sure to split your data into training data and testing data at
the very beginning and monitor the test accuracy to limit overfitting. You
need a reliable way to compare results.


Bye

Lorenzo


Il giorno mer 25 mar 2020 alle ore 09:54 Essam Zaky <[email protected]>
ha scritto:

> @Lorenozo
> I need to do that because because the accuracy of current Arabic not very
> good as English , and i have a lot fonts need to add to Arabic model
> adding them by fine tune will affect the model so  i need to build from
> scratch and make the model more generalized
> so i need to know what is done in English model and take it as a reference
> to make new Arabic model
>
>
> بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:
>>
>> Hi Dears ,
>>
>> I would like to build *.traindata from scratch specially for English and
>> Arabic
>>
>> So lets talk about English as example
>> my question how to prepare fonts folder?
>>
>> i read the
>> https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh
>> file
>> i found the this file contain about only 32 font name
>> should i add other Latin fonts installed in the training  machine to the
>> previous file "language-specific.sh" ?
>>
>>
>> i used "font manger" tool and i found about 147 font installed in
>> training machine
>> i opended
>> https://github.com/tesseract-ocr/langdata_lstm/blob/master/eng/okfonts.txt
>> and it contain 4567 font name
>> should i search and download and install all missing fonts in the
>> training machine ?
>>
>> should i collect all fonts files from training machine and create new
>> fonts folder "HOME/.fonts" and paste all fonts in that folder?
>>
>> i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... "
>> does all font types work in training or i need specific type ?
>>
>>
>> I will write another question about the required text data .
>>
>> Thanks for help
>>
>>
>>
>> Regards
>> Essam
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f74b7970-db67-4cb5-aec4-7a17192dc0ef%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f74b7970-db67-4cb5-aec4-7a17192dc0ef%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzOY-WDSbO0rx8ROnW%3DBBE8Af8TaYSGYroymynQGriuwA%40mail.gmail.com.

Reply via email to