tesseract 4.0.0-beta.1

This is quite old. I suggest you use latest build.

Not sure if @stweil is actively watching this forum. You can post a
question in tesstrain repo.



On Wed, Jan 29, 2020 at 7:32 PM Val LNB <[email protected]> wrote:

> Thank you for the link!
>
> I found the following example:
> https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#finetuning-based-on-scriptfraktur
>
> Here are instructions that I have figured out so far for fine-tuning an
> existing model:
>
> On Ubuntu 18.04 first I double checked for right packages
> dpkg -s tesseract-ocr
> dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model
> from
> https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/)
> then placed in ~/train/tessdata/script under name Fraktur.traineddata)
> dpkg -s libtesseract-dev (unsure if this package is necessary but I
> installed it a while ago)
>
> ~$ tesseract --version
> tesseract 4.0.0-beta.1
>
> git clone https://github.com/tesseract-ocr/tesstrain.git
>
> cd to tesstrain directory
>
> Then start the training process with the following command:
>
> make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script
> GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54
> MODEL_NAME=Frak_LV_J29
>
> so ~/train/tessdata/script/Fraktur.traineddata will be used for start
> while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files
>
> Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for
> testing assuming wiki is correct
>
> My only worry is that my .tif files apparently have no dpi information so
> default of 70 is used.
>
> Are the warnings about lack of dpi a bad sign?
>
>
> Interestingly, .png failes are used when running training so I could have
> perhaps skipped conversion to .tif since I started with .png! :)
>
> Now, the big question, how long will it take to run 10,000 epochs on
> average 4 core Xeon v3 VM?
>
>
>
>
>
> On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote:
>>
>> Please see https://github.com/tesseract-ocr/tesstrain/wiki
>>
>> There are already newly trained models by @stweil for Fraktur.
>>
>> On Tue, Jan 28, 2020, 22:46 Val LNB <[email protected]> wrote:
>>
>>> *How to perform incremental training on Tesseract 4.0+?*
>>>
>>>
>>> I want to improve the existing fraktur (frk) model with some 6000 hand
>>> curated lines from our library.
>>>
>>> Ground truth for these lines has 10 new unicode characters not present
>>> in German fraktur model.
>>>
>>>
>>> How can I continue training from the existing German fraktur model
>>> without full retraining?
>>>
>>>
>>> Progress so far:
>>>
>>>
>>>    - Following information on https://github.com/tesseract-ocr/tesstrain
>>>    - My script created the .tif and gt.txt files based on examples
>>>    provided in
>>>    https://github.com/tesseract-ocr/tesstrain/blob/master/ocrd-testset.zip
>>>    - Now makefile
>>>    https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile has
>>>    space for START_MODEL
>>>
>>>
>>> What/if anything do I enter into START_MODEL?
>>>
>>>
>>> It would be fantastic to see an example CLI command used for your
>>> incremental training. :)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWvZe7S8v2Up-6GcskV9%3DrKsz4%2BOfv4uuFi3SH5SEV8aw%40mail.gmail.com.

Reply via email to