I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train
It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine) Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it. All the useful details you might need can be found in this answer <https://groups.google.com/forum/#!searchin/tesseract-ocr/fine$20tuning$20english$20language%7Csort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ> Thanks to @ShreeShrii for providing support on every matter. Regards On Monday, August 6, 2018 at 8:25:07 AM UTC+1, Dimitry Khanukaev wrote: > > Hi is there way to do easy training with following concept: > - I know font of program messages that need recognition > - I know background > - Even amount of messages is limited > Could I? : > - Just pass to training pairs (the screenshot of the error message + the > text on that screenshot). > - Pass/train to Tesseract number of those pairs > Get training result and use Tesseract with it expecting that those images > (aka screenshots) with texts that I've supplied will be recognized very > correctly? (Up to the exact texts that I've been training them with) > > In other words is it conceptually wrong way of thinking? > Sort of I know my images I know exact text on them - can I just tell > Tesseract to train against the images to give me the texts that are paired > with those images) > Sort of not getting deep into boxes and stuff :) *Kinda training light?* > :) > > Thank you for any help. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/38761e42-fdc5-4cdf-9d4e-10e84ff41bd0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

