[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

'Danny' via tesseract-ocr Fri, 20 Oct 2023 17:30:11 -0700

The docs are pretty bad so I'm not surprised you didn't find an answer.
We also needed to train against a unusual font so here's our experience. 
Your situation might be different.

1. the training data needs to be much much bigger than 100 lines.  We took 
the ".wordlist" file from the language data directory, added our own words 
to the top and use that to generate ground truth.  It's about 50,000 lines.

2. each line should be separately rendered to a picture, gt.txt file 
containing the text in question and a .box file into ground-truth.  So 
that's three files for each of the 50,000 lines, total 150,000 files

Unfortunately, *text2Image* would not work for our specific font so we 
ended up writing our own code to generate the image and box files. It reads 
the wordlist file line by line, renders an image with the text line and 
uses the font info to extract the character boundaries. (unlike *text2image 
*our program figures out the overall bounding rectangle of the text, adds a 
margin, and creates the image exactly the right size.  *text2image*, at 
least in my experience, often creates a huge image with mostly whitespace 
around it.

3. use the 50,000 sets of ground truth files to train the model

Hope that helps.
Danny

On Friday, October 20, 2023 at 11:23:12 PM UTC+8 [email protected] wrote:

> Hello, I  simply cannot find the answer to this seemingly simple simple 
> question. I am trying to create a fresh *ground truth* for a highly 
> limited set of fonts, for training *tesseract 4.x*
>
> Using  *text2image* I have  rendered a large TIF-image and the 
> corresponding BOX-file,  from a 100-line-text-file, 
>
> My understanding is that this large image is not suitable for training, 
> and that I *must* break this down into single line images and txt files, 
> to start training. Am I mistaken?
>
> Now I am trying to continue with the tools in the 
> *tesseract-ocr/tesstrain* repo (to generate all those small images) But 
> for example  *generate_gt_from_box.py *outputs nothing. Nor can I see how 
> any of the *Makefile* targets apply to my goal. 
>
>
> Please help, thanks!
>
> _______________________________________________________________________________
> I have searched for days, so I also really wonder *where* I could have 
> found the answer to this myself. There are so many READMEs and resources  
> all over the place, so I feel like I might be staring at the answer without 
> realising it.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fbf75346-9d1c-409c-9330-b9db8f7e4749n%40googlegroups.com.

[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

Reply via email to