Hi Val,

How did you generate the 6k .gt.txt files from the tif files?

Thank you.

On Wednesday, 29 January 2020 14:02:40 UTC, Val LNB wrote:
>
> Thank you for the link!
>
> I found the following example: 
> https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur
>
> Here are instructions that I have figured out so far for fine-tuning an 
> existing model:
>
> On Ubuntu 18.04 first I double checked for right packages
> dpkg -s tesseract-ocr
> dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model 
> from 
> https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/)
>  
> then placed in ~/train/tessdata/script under name Fraktur.traineddata)
> dpkg -s libtesseract-dev (unsure if this package is necessary but I 
> installed it a while ago)
>
> ~$ tesseract --version
> tesseract 4.0.0-beta.1
>
> git clone https://github.com/tesseract-ocr/tesstrain.git
>
> cd to tesstrain directory 
>
> Then start the training process with the following command:
>
> make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script 
> GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 
> MODEL_NAME=Frak_LV_J29
>
> so ~/train/tessdata/script/Fraktur.traineddata will be used for start
> while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files
>
> Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for 
> testing assuming wiki is correct
>
> My only worry is that my .tif files apparently have no dpi information so 
> default of 70 is used.
>
> Are the warnings about lack of dpi a bad sign?
>
>
> Interestingly, .png failes are used when running training so I could have 
> perhaps skipped conversion to .tif since I started with .png! :)
>
> Now, the big question, how long will it take to run 10,000 epochs on 
> average 4 core Xeon v3 VM?
>
>
>
>  
>
> On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote:
>>
>> Please see https://github.com/tesseract-ocr/tesstrain/wiki
>>
>> There are already newly trained models by @stweil for Fraktur.
>>
>> On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected]> wrote:
>>
>>> *How to perform incremental training on Tesseract 4.0+?*
>>>
>>>
>>> I want to improve the existing fraktur (frk) model with some 6000 hand 
>>> curated lines from our library. 
>>>
>>> Ground truth for these lines has 10 new unicode characters not present 
>>> in German fraktur model.
>>>
>>>
>>> How can I continue training from the existing German fraktur model 
>>> without full retraining?
>>>
>>>
>>> Progress so far:
>>>
>>>
>>>    - Following information on https://github.com/tesseract-ocr/tesstrain
>>>    - My script created the .tif and gt.txt files based on examples 
>>>    provided in 
>>>    https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip
>>>    - Now makefile 
>>>    https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has 
>>>    space for START_MODEL 
>>>
>>>
>>> What/if anything do I enter into START_MODEL?
>>>
>>>
>>> It would be fantastic to see an example CLI command used for your 
>>> incremental training. :)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e40272f2-5ad9-4736-bd22-cd39c6470749%40googlegroups.com.

Reply via email to