I am trying to use Tesseract to OCR screen shots from various Windows
applications. So essentially the data is a random collection of letters
and numbers, not written words/sentences like it was primarily oriented to
handle.
Here is my setup:
-Tesseract Windows Version 5.0.0 from UB-Mannheim
-image cleaning and resizing using openCV (have put much effort into
getting this as good as I can)
-parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much
same results)
-config file contents
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0
Tesseract is performing reasonably well for my needs, but I have a couple
of problems that I can't resolve. They seem to be related to Tesseract
functionality which tries to decide what a given character is not just
based on its pixel layout, but also based on the context that the character
occurs in.
*Issue #1*
Occasionally Tesseract inserts extra characters in its output, seemingly
when it is unsure how to choose between a couple of different alternatives:
[image: OneOfThree.png]
For the above image, Tesseract produces the following output:
10of3
As you can see, Tesseract inserts the digit 0 in front of the lower case
letter o in the output. It also ignores the white space in the image.
Others have reported this issue, for example the thread below:
https://github.com/tesseract-ocr/tesseract/issues/1465
*Issue #2*
As shown in the above example, Tesseract sometimes ignores white space
which at least to my eye is big enough not to be missed.
*Issue #3*
Tesseract has a hard time dealing with random strings of alpha characters
and digits mixed together in no particular order. It has a tendency to
output a digit when the previous character was a digit, and an alpha when
the previous character was an alpha.
Others have reported this issue, for example the thread below:
https://github.com/tesseract-ocr/tesseract/issues/733
*Suggestion:*
At least for my situation, it seems that the best thing would be if there
were a definitive Tesseract option to interpret individual characters
without reference to their context. Since my data comes from screen shots,
it is very clear and very consistent, and I would think that a
character-by-character mode would work well.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com.