[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

Des Bw Sat, 21 Oct 2023 01:53:57 -0700

Hi Danny, 
Can you share your program for the community please? This is open source 
software; and many people are struggling to get things done. Sharing some 
experience and pieces of code could help a lot of people.


On Saturday, October 21, 2023 at 3:30:06 AM UTC+3 Danny wrote:

> The docs are pretty bad so I'm not surprised you didn't find an answer.
> We also needed to train against a unusual font so here's our experience. 
> Your situation might be different.
>
> 1. the training data needs to be much much bigger than 100 lines.  We took 
> the ".wordlist" file from the language data directory, added our own words 
> to the top and use that to generate ground truth.  It's about 50,000 lines.
>
> 2. each line should be separately rendered to a picture, gt.txt file 
> containing the text in question and a .box file into ground-truth.  So 
> that's three files for each of the 50,000 lines, total 150,000 files
>
> Unfortunately, *text2Image* would not work for our specific font so we 
> ended up writing our own code to generate the image and box files. It reads 
> the wordlist file line by line, renders an image with the text line and 
> uses the font info to extract the character boundaries. (unlike *text2image 
> *our program figures out the overall bounding rectangle of the text, adds 
> a margin, and creates the image exactly the right size.  *text2image*, at 
> least in my experience, often creates a huge image with mostly whitespace 
> around it.
>
> 3. use the 50,000 sets of ground truth files to train the model
>
> Hope that helps.
> Danny
>
>
> On Friday, October 20, 2023 at 11:23:12 PM UTC+8 [email protected] wrote:
>
>> Hello, I  simply cannot find the answer to this seemingly simple simple 
>> question. I am trying to create a fresh *ground truth* for a highly 
>> limited set of fonts, for training *tesseract 4.x*
>>
>> Using  *text2image* I have  rendered a large TIF-image and the 
>> corresponding BOX-file,  from a 100-line-text-file, 
>>
>> My understanding is that this large image is not suitable for training, 
>> and that I *must* break this down into single line images and txt files, 
>> to start training. Am I mistaken?
>>
>> Now I am trying to continue with the tools in the 
>> *tesseract-ocr/tesstrain* repo (to generate all those small images) But 
>> for example  *generate_gt_from_box.py *outputs nothing. Nor can I see 
>> how any of the *Makefile* targets apply to my goal. 
>>
>>
>> Please help, thanks!
>>
>> _______________________________________________________________________________
>> I have searched for days, so I also really wonder *where* I could have 
>> found the answer to this myself. There are so many READMEs and resources  
>> all over the place, so I feel like I might be staring at the answer without 
>> realising it.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/87d5a40b-62cf-468e-8df2-bca464772887n%40googlegroups.com.

[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

Reply via email to