tesseract 4.0.0-beta.1 This is quite old. I suggest you use latest build.
Not sure if @stweil is actively watching this forum. You can post a question in tesstrain repo. On Wed, Jan 29, 2020 at 7:32 PM Val LNB <[email protected]> wrote: > Thank you for the link! > > I found the following example: > https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur > > Here are instructions that I have figured out so far for fine-tuning an > existing model: > > On Ubuntu 18.04 first I double checked for right packages > dpkg -s tesseract-ocr > dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model > from > https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/) > then placed in ~/train/tessdata/script under name Fraktur.traineddata) > dpkg -s libtesseract-dev (unsure if this package is necessary but I > installed it a while ago) > > ~$ tesseract --version > tesseract 4.0.0-beta.1 > > git clone https://github.com/tesseract-ocr/tesstrain.git > > cd to tesstrain directory > > Then start the training process with the following command: > > make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script > GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 > MODEL_NAME=Frak_LV_J29 > > so ~/train/tessdata/script/Fraktur.traineddata will be used for start > while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files > > Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for > testing assuming wiki is correct > > My only worry is that my .tif files apparently have no dpi information so > default of 70 is used. > > Are the warnings about lack of dpi a bad sign? > > > Interestingly, .png failes are used when running training so I could have > perhaps skipped conversion to .tif since I started with .png! :) > > Now, the big question, how long will it take to run 10,000 epochs on > average 4 core Xeon v3 VM? > > > > > > On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote: >> >> Please see https://github.com/tesseract-ocr/tesstrain/wiki >> >> There are already newly trained models by @stweil for Fraktur. >> >> On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected]> wrote: >> >>> *How to perform incremental training on Tesseract 4.0+?* >>> >>> >>> I want to improve the existing fraktur (frk) model with some 6000 hand >>> curated lines from our library. >>> >>> Ground truth for these lines has 10 new unicode characters not present >>> in German fraktur model. >>> >>> >>> How can I continue training from the existing German fraktur model >>> without full retraining? >>> >>> >>> Progress so far: >>> >>> >>> - Following information on https://github.com/tesseract-ocr/tesstrain >>> - My script created the .tif and gt.txt files based on examples >>> provided in >>> https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip >>> - Now makefile >>> https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has >>> space for START_MODEL >>> >>> >>> What/if anything do I enter into START_MODEL? >>> >>> >>> It would be fantastic to see an example CLI command used for your >>> incremental training. :) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWvZe7S8v2Up-6GcskV9%3DrKsz4%2BOfv4uuFi3SH5SEV8aw%40mail.gmail.com.

