You can have a look at ocrd-train https://github.com/OCR-D/ocrd-train
You just have to prepare cropped tiff and txt files with the same name containing a single line of text. At the same time, if you already set up everything for the font based training, I'd give it a try (time permitting): you get something working today, you can make a comparison with different methods, etc. Lorenzo Il giorno gio 31 gen 2019 alle ore 11:35 Daniel Ferenc <[email protected]> ha scritto: > Is there a guide somewhere how to setup training like this? How to pair > the images and text, etc..? And thank you for the insight, it really is > helpful. > > On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote: >> >> Yes, generating text is faster and easier. >> >> But the real extracted and cleaned text you are going to eventually >> recognize is going to be different from this, more or less depending on a >> lot of factors: >> - how similar your training font actually is >> - how good your cleanup will be (test this in advance) >> - difference in text size, border, rotations, shearing from the generated >> text (for example you train with 0px border and later provide text with 4px >> border). >> >> Using the real data, in general, should be better, unless you have very >> little data. >> >> If the real images differ from the generated ones you may try to add some >> corruption mimicking the real one before the training: noise, perspective >> deformations, small rotations, etc. >> >> And/or you can try to mix real and generated samples in the training. >> >> You say 90% of the samples are easy to process: these can be enough if >> you can isolate these easily. Consider that real life samples will not be >> much better than these (I suppose). >> >> About the rotations you can do perspective correction with opencv >> findHomography or with hough lines. >> >> I realize this is A LOT of work as I'm doing this right now. >> >> If you have time, try different ways and see what works best. >> >> >> >> Bye >> >> Lorenzo >> >> >> Il giorno mer 30 gen 2019 alle ore 16:08 Daniel Ferenc <[email protected]> >> ha scritto: >> >>> I'm not sure how exactly would I setup that (regarding tesseract >>> training) BUT there are about 44000 (english) cards at this time and a high >>> resolution image of each is about 2 megs (at least from the resource I can >>> get them from). Also, not each card is the same format so a generic crop >>> function would not work. Over 90% of the cards would be OK like this but >>> the rest would cause issues. It's easier for me to try and teach tesseract >>> this way and then have the software try different rotations/crops if the >>> default one doesn't return anything meaningful in means of OCR. Just >>> preparing the images for this is a massive task while retrieving the word >>> list from the database was about 20 seconds, a minute to download the fonts >>> and ~4 hours of training for a result that will be, hopefully, good enough. >>> >>> On Wednesday, January 30, 2019 at 3:53:43 PM UTC+1, Lorenzo Blz wrote: >>>> >>>> >>>> If you have images of the cards with the corresponding text you could >>>> train it on the cropped/cleaned text directly. >>>> >>>> Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc < >>>> [email protected]> ha scritto: >>>> >>>>> So, I have figured out what was I doing wrong: >>>>> >>>>> - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and >>>>> they were obviously missing some langdata which I downloaded from the >>>>> repository >>>>> - There was also a need to get the Latin.unicharsert file >>>>> - And finally I didn't notice an error in one of the late steps that >>>>> said radical-stroke.txt is missing and that resulted in traineddata not >>>>> getting generated for my tesstrain.sh script run >>>>> - And since the last step required the traineddata and I didn' t have >>>>> one so I used the package provided eng.traineddata which came with the >>>>> package and it all resultet in very poor recognition performance >>>>> >>>>> At this moment I'm running the training with a wordlist of possible >>>>> ~13600 words that can appear with ~100 fonts that can be used... Waiting >>>>> for 175000 iterations to finish because at 150k I stil had an error rate >>>>> of >>>>> ~2.4 >>>>> >>>>> (I'm creating a piece of software that should recognize Magic: the >>>>> Gathering card names. I have a database of all currently existing cards >>>>> (english ones) and created a word list of unique words that can appear in >>>>> their name and am training tesseract with these words with all the >>>>> possible >>>>> fonts that were ever used for these cards. I will let you know how this >>>>> worked out once the training is done.) >>>>> >>>>> Thank you for your support. >>>>> >>>>> On Tuesday, January 29, 2019 at 6:40:14 PM UTC+1, shree wrote: >>>>>> >>>>>> Finetune with your specific font - see eg. below which uses IMPACT >>>>>> font. >>>>>> >>>>>> #!/bin/bash >>>>>> >>>>>> time ~/tesseract/src/training/tesstrain.sh \ >>>>>> --fonts_dir /usr/share/fonts \ >>>>>> --lang eng --linedata_only \ >>>>>> --noextract_font_properties \ >>>>>> --langdata_dir ~/langdata \ >>>>>> --tessdata_dir ~/tessdata \ >>>>>> --fontlist "Impact Condensed" \ >>>>>> --training_text ~/langdata/eng/eng.training_text \ >>>>>> --workspace_dir ~/tmp/ \ >>>>>> --save_box_tiff \ >>>>>> --output_dir ~/tesstutorial/engtrainfont >>>>>> >>>>>> time ~/tesseract/src/training/tesstrain.sh \ >>>>>> --fonts_dir /usr/share/fonts \ >>>>>> --lang eng --linedata_only \ >>>>>> --noextract_font_properties \ >>>>>> --langdata_dir ~/langdata \ >>>>>> --tessdata_dir ~/tessdata \ >>>>>> --fontlist "Impact Condensed" \ >>>>>> --training_text ~/langdata/eng/eng.mywordlist.training_text \ >>>>>> --workspace_dir ~/tmp/ \ >>>>>> --save_box_tiff \ >>>>>> --output_dir ~/tesstutorial/engevalwordlist >>>>>> >>>>>> # >>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact >>>>>> >>>>>> echo "/n ****** Finetune one of the fully-trained existing models: >>>>>> ***********" >>>>>> >>>>>> mkdir -p ~/tesstutorial/impact_from_full >>>>>> >>>>>> combine_tessdata -e ~/tessdata_best/eng.traineddata \ >>>>>> ~/tesstutorial/impact_from_full/eng.lstm >>>>>> >>>>>> time ~/tesseract/src/training/lstmtraining \ >>>>>> --model_output ~/tesstutorial/impact_from_full/impact \ >>>>>> --continue_from ~/tesstutorial/impact_from_full/eng.lstm \ >>>>>> --traineddata ~/tessdata_best/eng.traineddata \ >>>>>> --train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt >>>>>> \ >>>>>> --debug_interval -1 \ >>>>>> --max_iterations 400 >>>>>> >>>>>> echo -e "\n*********** eval on training data ******\n" >>>>>> >>>>>> time ~/tesseract/src/training/lstmeval \ >>>>>> --model ~/tesstutorial/impact_from_full/impact_checkpoint \ >>>>>> --traineddata ~/tessdata_best/eng.traineddata \ >>>>>> --eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt >>>>>> >>>>>> echo -e "\n***********eval on eval data ******\n" >>>>>> >>>>>> time ~/tesseract/src/training/lstmeval \ >>>>>> --model ~/tesstutorial/impact_from_full/impact_checkpoint \ >>>>>> --traineddata ~/tessdata_best/eng.traineddata \ >>>>>> --eval_listfile >>>>>> ~/tesstutorial/engevalwordlist/eng.training_files.txt >>>>>> >>>>>> echo -e "\n*********** convert to traineddata ******\n" >>>>>> >>>>>> time ../tesseract/src/training/lstmtraining \ >>>>>> --stop_training \ >>>>>> --continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \ >>>>>> --traineddata ~/tessdata_best/eng.traineddata \ >>>>>> --model_output ~/tesstutorial/engtrainfont/eng.traineddata >>>>>> >>>>>> >>>>>> On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I need to train Tesseract for only a specific wordlist (about 13600 >>>>>>> words) and one specific font. I tried following the training tutorial on >>>>>>> the Wiki but I'm not sure if i'm doing anything wrong - the traineddata >>>>>>> file is about 7 megabytes and i combined it with the eng.traineddata to >>>>>>> get >>>>>>> any traineddata file because after finishing the training I had no >>>>>>> traineddata file at all. Can anyone please help me? >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxs_WsDoaGAkoaYrD9D29u46ox%3DYfPnMv60ANzraSWCAQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

