Re: [tesseract-ocr] Training for a specific wordlist and font

Lorenzo Bolzani Thu, 31 Jan 2019 04:16:01 -0800

You can have a look at ocrd-train

https://github.com/OCR-D/ocrd-train


You just have to prepare cropped tiff and txt files with the same name
containing a single line of text.

At the same time, if you already set up everything for the font based
training, I'd give it a try (time permitting): you get something working
today, you can make a comparison with different methods, etc.



Lorenzo

Il giorno gio 31 gen 2019 alle ore 11:35 Daniel Ferenc <[email protected]>
ha scritto:

> Is there a guide somewhere how to setup training like this? How to pair
> the images and text, etc..? And thank you for the insight, it really is
> helpful.
>
> On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote:
>>
>> Yes, generating text is faster and easier.
>>
>> But the real extracted and cleaned text you are going to eventually
>> recognize is going to be different from this, more or less depending on a
>> lot of factors:
>> - how similar your training font actually is
>> - how good your cleanup will be (test this in advance)
>> - difference in text size, border, rotations, shearing from the generated
>> text (for example you train with 0px border and later provide text with 4px
>> border).
>>
>> Using the real data, in general, should be better, unless you have very
>> little data.
>>
>> If the real images differ from the generated ones you may try to add some
>> corruption mimicking the real one before the training: noise, perspective
>> deformations, small rotations, etc.
>>
>> And/or you can try to mix real and generated samples in the training.
>>
>> You say 90% of the samples are easy to process: these can be enough if
>> you can isolate these easily. Consider that real life samples will not be
>> much better than these (I suppose).
>>
>> About the rotations you can do perspective correction with opencv
>> findHomography or with hough lines.
>>
>> I realize this is A LOT of work as I'm doing this right now.
>>
>> If you have time, try different ways and see what works best.
>>
>>
>>
>> Bye
>>
>> Lorenzo
>>
>>
>> Il giorno mer 30 gen 2019 alle ore 16:08 Daniel Ferenc <[email protected]>
>> ha scritto:
>>
>>> I'm not sure how exactly would I setup that (regarding tesseract
>>> training) BUT there are about 44000 (english) cards at this time and a high
>>> resolution image of each is about 2 megs (at least from the resource I can
>>> get them from). Also, not each card is the same format so a generic crop
>>> function would not work. Over 90% of the cards would be OK like this but
>>> the rest would cause issues. It's easier for me to try and teach tesseract
>>> this way and then have the software try different rotations/crops if the
>>> default one doesn't return anything meaningful in means of OCR. Just
>>> preparing the images for this is a massive task while retrieving the word
>>> list from the database was about 20 seconds, a minute to download the fonts
>>> and ~4 hours of training for a result that will be, hopefully, good enough.
>>>
>>> On Wednesday, January 30, 2019 at 3:53:43 PM UTC+1, Lorenzo Blz wrote:
>>>>
>>>>
>>>> If you have images of the cards with the corresponding text you could
>>>> train it on the cropped/cleaned text directly.
>>>>
>>>> Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc <
>>>> [email protected]> ha scritto:
>>>>
>>>>> So, I have figured out what was I doing wrong:
>>>>>
>>>>> - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and
>>>>> they were obviously missing some langdata which I downloaded from the
>>>>> repository
>>>>> - There was also a need to get the Latin.unicharsert file
>>>>> - And finally I didn't notice an error in one of the late steps that
>>>>> said radical-stroke.txt is missing and that resulted in traineddata not
>>>>> getting generated for my tesstrain.sh script run
>>>>> - And since the last step required the traineddata and I didn' t have
>>>>> one so I used the package provided eng.traineddata which came with the
>>>>> package and it all resultet in very poor recognition performance
>>>>>
>>>>> At this moment I'm running the training with a wordlist of possible
>>>>> ~13600 words that can appear with ~100 fonts that can be used... Waiting
>>>>> for 175000 iterations to finish because at 150k I stil had an error rate 
>>>>> of
>>>>> ~2.4
>>>>>
>>>>> (I'm creating a piece of software that should recognize Magic: the
>>>>> Gathering card names. I have a database of all currently existing cards
>>>>> (english ones) and created a word list of unique words that can appear in
>>>>> their name and am training tesseract with these words with all the 
>>>>> possible
>>>>> fonts that were ever used for these cards. I will let you know how this
>>>>> worked out once the training is done.)
>>>>>
>>>>> Thank you for your support.
>>>>>
>>>>> On Tuesday, January 29, 2019 at 6:40:14 PM UTC+1, shree wrote:
>>>>>>
>>>>>> Finetune with your specific font - see eg. below which uses IMPACT
>>>>>> font.
>>>>>>
>>>>>> #!/bin/bash
>>>>>>
>>>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>>>   --fonts_dir /usr/share/fonts \
>>>>>>   --lang eng --linedata_only \
>>>>>>   --noextract_font_properties \
>>>>>>   --langdata_dir ~/langdata \
>>>>>>   --tessdata_dir ~/tessdata \
>>>>>>   --fontlist "Impact Condensed" \
>>>>>>   --training_text ~/langdata/eng/eng.training_text \
>>>>>>   --workspace_dir ~/tmp/ \
>>>>>>   --save_box_tiff \
>>>>>>   --output_dir ~/tesstutorial/engtrainfont
>>>>>>
>>>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>>>   --fonts_dir /usr/share/fonts \
>>>>>>   --lang eng --linedata_only \
>>>>>>   --noextract_font_properties \
>>>>>>   --langdata_dir ~/langdata \
>>>>>>   --tessdata_dir ~/tessdata \
>>>>>>   --fontlist "Impact Condensed" \
>>>>>>   --training_text ~/langdata/eng/eng.mywordlist.training_text \
>>>>>>   --workspace_dir ~/tmp/ \
>>>>>>   --save_box_tiff \
>>>>>>   --output_dir ~/tesstutorial/engevalwordlist
>>>>>>
>>>>>> #
>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
>>>>>>
>>>>>> echo "/n ****** Finetune one of the fully-trained existing models:
>>>>>> ***********"
>>>>>>
>>>>>> mkdir -p ~/tesstutorial/impact_from_full
>>>>>>
>>>>>> combine_tessdata -e ~/tessdata_best/eng.traineddata \
>>>>>>   ~/tesstutorial/impact_from_full/eng.lstm
>>>>>>
>>>>>> time ~/tesseract/src/training/lstmtraining \
>>>>>>   --model_output ~/tesstutorial/impact_from_full/impact \
>>>>>>   --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
>>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>>   --train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt
>>>>>> \
>>>>>>   --debug_interval -1 \
>>>>>>   --max_iterations 400
>>>>>>
>>>>>> echo -e "\n*********** eval on training data ******\n"
>>>>>>
>>>>>> time ~/tesseract/src/training/lstmeval \
>>>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>>   --eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt
>>>>>>
>>>>>> echo -e "\n***********eval on eval data ******\n"
>>>>>>
>>>>>> time ~/tesseract/src/training/lstmeval \
>>>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>>   --eval_listfile
>>>>>> ~/tesstutorial/engevalwordlist/eng.training_files.txt
>>>>>>
>>>>>> echo -e "\n*********** convert to traineddata  ******\n"
>>>>>>
>>>>>> time ../tesseract/src/training/lstmtraining \
>>>>>>   --stop_training \
>>>>>>   --continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>>   --model_output ~/tesstutorial/engtrainfont/eng.traineddata
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I need to train Tesseract for only a specific wordlist (about 13600
>>>>>>> words) and one specific font. I tried following the training tutorial on
>>>>>>> the Wiki but I'm not sure if i'm doing anything wrong - the traineddata
>>>>>>> file is about 7 megabytes and i combined it with the eng.traineddata to 
>>>>>>> get
>>>>>>> any traineddata file because after finishing the training I had no
>>>>>>> traineddata file at all. Can anyone please help me?
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxs_WsDoaGAkoaYrD9D29u46ox%3DYfPnMv60ANzraSWCAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training for a specific wordlist and font

Reply via email to