[tesseract-ocr] Force Tesseract to do individual character OCR only

Dave Wood Mon, 28 Oct 2019 21:13:25 -0700

I am trying to use Tesseract to OCR screen shots from various Windows 
applications.  So essentially the data is a random collection of letters 
and numbers, not written words/sentences like it was primarily oriented to 
handle.


Here is my setup:

-Tesseract Windows Version 5.0.0 from UB-Mannheim
-image cleaning and resizing using openCV (have put much effort into 
getting this as good as I can)
-parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much 
same results)
-config file contents
     language_model_penalty_non_dict_word 0.0
     language_model_penalty_chartype 0.0
     language_model_penalty_case 0.0
     language_model_penalty_non_freq_dict_word 0.0

Tesseract is performing reasonably well for my needs, but I have a couple 
of problems that I can't resolve.  They seem to be related to Tesseract 
functionality which tries to decide what a given character is not just 
based on its pixel layout, but also based on the context that the character 
occurs in.

*Issue #1*

Occasionally Tesseract inserts extra characters in its output, seemingly 
when it is unsure how to choose between a couple of different alternatives:

[image: OneOfThree.png]
For the above image, Tesseract produces the following output:

10of3

As you can see, Tesseract inserts the digit 0 in front of the lower case 
letter o in the output.  It also ignores the white space in the image.

Others have reported this issue, for example the thread below:

https://github.com/tesseract-ocr/tesseract/issues/1465

*Issue #2*

As shown in the above example, Tesseract sometimes ignores white space 
which at least to my eye is big enough not to be missed.

*Issue #3*

Tesseract has a hard time dealing with random strings of alpha characters 
and digits mixed together in no particular order.  It has a tendency to 
output a digit when the previous character was a digit, and an alpha when 
the previous character was an alpha.

Others have reported this issue, for example the thread below:

https://github.com/tesseract-ocr/tesseract/issues/733


*Suggestion:*

At least for my situation, it seems that the best thing would be if there 
were a definitive Tesseract option to interpret individual characters 
without reference to their context.  Since my data comes from screen shots, 
it is very clear and very consistent, and I would think that a 
character-by-character mode would work well.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com.

[tesseract-ocr] Force Tesseract to do individual character OCR only

Reply via email to