[tesseract-ocr] Handling noise

Paul Sat, 02 Aug 2014 13:14:28 -0700

Hi all,

I have several scanned documents that have a lot of noise between lines and 
between words. Tesseract fails to ignore them and it either includes them 
in the next character or makes them a separate character, often a dot or 
comma. I attached an image that shows some of that noise.


I am using the latest SVN version of Tesseract 3.03. Tesseract 3.02 does 
slightly better at ignoring the noise.

Now my questions are:

   1. What are the configuration parameters (maybe also hard coded 
   constants) inside Tesseract that affect the noise vs. good blob 
   classification?
   2. Is there a way to define a minimum number of pixels or dimensions for 
   a connected component?
   3. Is there a way to limit the scaling of a blob, so that it won't get 
   matched to a character prototype?

I already found the configuration parameters heavy_noise_reduction and 
textord_noise_hfract, but heavy_noise_reduction gives me bad results and by 
teaking textord_noise_reduction I can get better results, but they still 
aren't satisfying. Maybe there's a better alternative in the code that I 
can't find.

Regards,
Paul

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/129bbd4f-ff94-40a0-85a8-a8a9740bbf1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Handling noise

Reply via email to