Re: [tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

Val LNB Wed, 29 Jan 2020 06:03:12 -0800

Thank you for the link!

I found the following example: 
https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur


Here are instructions that I have figured out so far for fine-tuning an 
existing model:

On Ubuntu 18.04 first I double checked for right packages
dpkg -s tesseract-ocr
dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model from 
https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/)
 
then placed in ~/train/tessdata/script under name Fraktur.traineddata)
dpkg -s libtesseract-dev (unsure if this package is necessary but I 
installed it a while ago)

~$ tesseract --version
tesseract 4.0.0-beta.1

git clone https://github.com/tesseract-ocr/tesstrain.git

cd to tesstrain directory 

Then start the training process with the following command:

make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script 
GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 
MODEL_NAME=Frak_LV_J29

so ~/train/tessdata/script/Fraktur.traineddata will be used for start
while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files

Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for 
testing assuming wiki is correct

My only worry is that my .tif files apparently have no dpi information so 
default of 70 is used.

Are the warnings about lack of dpi a bad sign?


Interestingly, .png failes are used when running training so I could have 
perhaps skipped conversion to .tif since I started with .png! :)

Now, the big question, how long will it take to run 10,000 epochs on 
average 4 core Xeon v3 VM?



 

On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote:
>
> Please see https://github.com/tesseract-ocr/tesstrain/wiki
>
> There are already newly trained models by @stweil for Fraktur.
>
> On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected] <javascript:>> 
> wrote:
>
>> *How to perform incremental training on Tesseract 4.0+?*
>>
>>
>> I want to improve the existing fraktur (frk) model with some 6000 hand 
>> curated lines from our library. 
>>
>> Ground truth for these lines has 10 new unicode characters not present in 
>> German fraktur model.
>>
>>
>> How can I continue training from the existing German fraktur model 
>> without full retraining?
>>
>>
>> Progress so far:
>>
>>
>>    - Following information on https://github.com/tesseract-ocr/tesstrain
>>    - My script created the .tif and gt.txt files based on examples 
>>    provided in 
>>    https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip
>>    - Now makefile 
>>    https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has 
>>    space for START_MODEL 
>>
>>
>> What/if anything do I enter into START_MODEL?
>>
>>
>> It would be fantastic to see an example CLI command used for your 
>> incremental training. :)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com.

Re: [tesseract-ocr] Incremental Training Tesseract 4.0+ for fraktur

Reply via email to