Hi Val, How did you generate the 6k .gt.txt files from the tif files?
Thank you. On Wednesday, 29 January 2020 14:02:40 UTC, Val LNB wrote: > > Thank you for the link! > > I found the following example: > https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur > > Here are instructions that I have figured out so far for fine-tuning an > existing model: > > On Ubuntu 18.04 first I double checked for right packages > dpkg -s tesseract-ocr > dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model > from > https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/) > > then placed in ~/train/tessdata/script under name Fraktur.traineddata) > dpkg -s libtesseract-dev (unsure if this package is necessary but I > installed it a while ago) > > ~$ tesseract --version > tesseract 4.0.0-beta.1 > > git clone https://github.com/tesseract-ocr/tesstrain.git > > cd to tesstrain directory > > Then start the training process with the following command: > > make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script > GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 > MODEL_NAME=Frak_LV_J29 > > so ~/train/tessdata/script/Fraktur.traineddata will be used for start > while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files > > Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for > testing assuming wiki is correct > > My only worry is that my .tif files apparently have no dpi information so > default of 70 is used. > > Are the warnings about lack of dpi a bad sign? > > > Interestingly, .png failes are used when running training so I could have > perhaps skipped conversion to .tif since I started with .png! :) > > Now, the big question, how long will it take to run 10,000 epochs on > average 4 core Xeon v3 VM? > > > > > > On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote: >> >> Please see https://github.com/tesseract-ocr/tesstrain/wiki >> >> There are already newly trained models by @stweil for Fraktur. >> >> On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected]> wrote: >> >>> *How to perform incremental training on Tesseract 4.0+?* >>> >>> >>> I want to improve the existing fraktur (frk) model with some 6000 hand >>> curated lines from our library. >>> >>> Ground truth for these lines has 10 new unicode characters not present >>> in German fraktur model. >>> >>> >>> How can I continue training from the existing German fraktur model >>> without full retraining? >>> >>> >>> Progress so far: >>> >>> >>> - Following information on https://github.com/tesseract-ocr/tesstrain >>> - My script created the .tif and gt.txt files based on examples >>> provided in >>> https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip >>> - Now makefile >>> https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has >>> space for START_MODEL >>> >>> >>> What/if anything do I enter into START_MODEL? >>> >>> >>> It would be fantastic to see an example CLI command used for your >>> incremental training. :) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e40272f2-5ad9-4736-bd22-cd39c6470749%40googlegroups.com.

