Thank you for the link! I found the following example: https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur
Here are instructions that I have figured out so far for fine-tuning an existing model: On Ubuntu 18.04 first I double checked for right packages dpkg -s tesseract-ocr dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/) then placed in ~/train/tessdata/script under name Fraktur.traineddata) dpkg -s libtesseract-dev (unsure if this package is necessary but I installed it a while ago) ~$ tesseract --version tesseract 4.0.0-beta.1 git clone https://github.com/tesseract-ocr/tesstrain.git cd to tesstrain directory Then start the training process with the following command: make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 MODEL_NAME=Frak_LV_J29 so ~/train/tessdata/script/Fraktur.traineddata will be used for start while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for testing assuming wiki is correct My only worry is that my .tif files apparently have no dpi information so default of 70 is used. Are the warnings about lack of dpi a bad sign? Interestingly, .png failes are used when running training so I could have perhaps skipped conversion to .tif since I started with .png! :) Now, the big question, how long will it take to run 10,000 epochs on average 4 core Xeon v3 VM? On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote: > > Please see https://github.com/tesseract-ocr/tesstrain/wiki > > There are already newly trained models by @stweil for Fraktur. > > On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected] <javascript:>> > wrote: > >> *How to perform incremental training on Tesseract 4.0+?* >> >> >> I want to improve the existing fraktur (frk) model with some 6000 hand >> curated lines from our library. >> >> Ground truth for these lines has 10 new unicode characters not present in >> German fraktur model. >> >> >> How can I continue training from the existing German fraktur model >> without full retraining? >> >> >> Progress so far: >> >> >> - Following information on https://github.com/tesseract-ocr/tesstrain >> - My script created the .tif and gt.txt files based on examples >> provided in >> https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip >> - Now makefile >> https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has >> space for START_MODEL >> >> >> What/if anything do I enter into START_MODEL? >> >> >> It would be fantastic to see an example CLI command used for your >> incremental training. :) >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com.

