Hi Danny, Can you share your program for the community please? This is open source software; and many people are struggling to get things done. Sharing some experience and pieces of code could help a lot of people.
On Saturday, October 21, 2023 at 3:30:06 AM UTC+3 Danny wrote: > The docs are pretty bad so I'm not surprised you didn't find an answer. > We also needed to train against a unusual font so here's our experience. > Your situation might be different. > > 1. the training data needs to be much much bigger than 100 lines. We took > the ".wordlist" file from the language data directory, added our own words > to the top and use that to generate ground truth. It's about 50,000 lines. > > 2. each line should be separately rendered to a picture, gt.txt file > containing the text in question and a .box file into ground-truth. So > that's three files for each of the 50,000 lines, total 150,000 files > > Unfortunately, *text2Image* would not work for our specific font so we > ended up writing our own code to generate the image and box files. It reads > the wordlist file line by line, renders an image with the text line and > uses the font info to extract the character boundaries. (unlike *text2image > *our program figures out the overall bounding rectangle of the text, adds > a margin, and creates the image exactly the right size. *text2image*, at > least in my experience, often creates a huge image with mostly whitespace > around it. > > 3. use the 50,000 sets of ground truth files to train the model > > Hope that helps. > Danny > > > On Friday, October 20, 2023 at 11:23:12 PM UTC+8 [email protected] wrote: > >> Hello, I simply cannot find the answer to this seemingly simple simple >> question. I am trying to create a fresh *ground truth* for a highly >> limited set of fonts, for training *tesseract 4.x* >> >> Using *text2image* I have rendered a large TIF-image and the >> corresponding BOX-file, from a 100-line-text-file, >> >> My understanding is that this large image is not suitable for training, >> and that I *must* break this down into single line images and txt files, >> to start training. Am I mistaken? >> >> Now I am trying to continue with the tools in the >> *tesseract-ocr/tesstrain* repo (to generate all those small images) But >> for example *generate_gt_from_box.py *outputs nothing. Nor can I see >> how any of the *Makefile* targets apply to my goal. >> >> >> Please help, thanks! >> >> _______________________________________________________________________________ >> I have searched for days, so I also really wonder *where* I could have >> found the answer to this myself. There are so many READMEs and resources >> all over the place, so I feel like I might be staring at the answer without >> realising it. >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/87d5a40b-62cf-468e-8df2-bca464772887n%40googlegroups.com.

