this is the part from https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
My question related to the image part not making training from text The overall training process is similar to training 3.04 <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> Conceptually the same: 1. Prepare training text. <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951> 2. Render text to image + box file. (Or create hand-made box files for existing image data.) 3. Make unicharset file. 4. Optionally make dictionary data. 5. Run tesseract to process image + box file to make training data set. 6. Run training on training data set. 7. Combine data files. Are the above steps similar to: tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train unicharset_extractor ara.arial.exp4.box echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations about the font mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4 .tr shapeclustering -F unicharset ara.arial.exp4.tr cntraining ara.arial.exp4.tr mv inttemp ara.inttemp mv normproto ara.normproto mv pffmtable ara.pffmtable mv shapetable ara.shapetable combine_tessdata ara. Should I use these steps or not. The key differences are: - The boxes only need to be at the *textline level.* It is thus *far easier* to make training data from existing image data. - The .tr files are replaced by .lstmf data files. - Fonts *can and should be mixed freely* instead of being separate. - The clustering steps (mftraining, cntraining, shapeclustering) are replaced with a single slow lstmtraining step. for this part i don't a lot about it. Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

