The docs are pretty bad so I'm not surprised you didn't find an answer. We also needed to train against a unusual font so here's our experience. Your situation might be different.
1. the training data needs to be much much bigger than 100 lines. We took the ".wordlist" file from the language data directory, added our own words to the top and use that to generate ground truth. It's about 50,000 lines. 2. each line should be separately rendered to a picture, gt.txt file containing the text in question and a .box file into ground-truth. So that's three files for each of the 50,000 lines, total 150,000 files Unfortunately, *text2Image* would not work for our specific font so we ended up writing our own code to generate the image and box files. It reads the wordlist file line by line, renders an image with the text line and uses the font info to extract the character boundaries. (unlike *text2image *our program figures out the overall bounding rectangle of the text, adds a margin, and creates the image exactly the right size. *text2image*, at least in my experience, often creates a huge image with mostly whitespace around it. 3. use the 50,000 sets of ground truth files to train the model Hope that helps. Danny On Friday, October 20, 2023 at 11:23:12 PM UTC+8 [email protected] wrote: > Hello, I simply cannot find the answer to this seemingly simple simple > question. I am trying to create a fresh *ground truth* for a highly > limited set of fonts, for training *tesseract 4.x* > > Using *text2image* I have rendered a large TIF-image and the > corresponding BOX-file, from a 100-line-text-file, > > My understanding is that this large image is not suitable for training, > and that I *must* break this down into single line images and txt files, > to start training. Am I mistaken? > > Now I am trying to continue with the tools in the > *tesseract-ocr/tesstrain* repo (to generate all those small images) But > for example *generate_gt_from_box.py *outputs nothing. Nor can I see how > any of the *Makefile* targets apply to my goal. > > > Please help, thanks! > > _______________________________________________________________________________ > I have searched for days, so I also really wonder *where* I could have > found the answer to this myself. There are so many READMEs and resources > all over the place, so I feel like I might be staring at the answer without > realising it. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbf75346-9d1c-409c-9330-b9db8f7e4749n%40googlegroups.com.

