[tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Des Bw Sat, 21 Oct 2023 05:28:07 -0700

There is no exhaustive user manual for training tesseract. We all start in 
the darkness; and accumulate bits of information in different places to 
learn the ins and outs of tesseract.


It would be great if we can collectively write a better manual. Up until 
then, we can drop /collect our observations, best  practices, hacks and 
lessons we accumulated in our adventure with tesseract.  

I will start with some of my observations. I collect them by reading in 
between the lines: from my own failed experiments: 
1. Training from scratch is very difficult because tesseract requires 
extensive data set. It looks like it requires over 300,000 test lines 
(around 26mb text file).
https://github.com/tesseract-ocr/tesseract/issues/3909

 Multiple that with the fonts you want to train, the data grows 
exponentially. That requires very powerful computers running for weeks and 
months. 
So, for the regular users, training from a network layer, or fine tuning 
are the most plausible options. 

2. Best practice: make your text lines not too long. The recommended number 
of works in a line is 10-12. Again from the above link. 

( ...to be continued)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com.

[tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Reply via email to