Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

Shree Devi Kumar Mon, 28 Oct 2019 21:18:27 -0700

Have you tried to ocr it character by character, using appropriate psm.

On Tue, Oct 29, 2019, 09:42 Dave Wood <[email protected]> wrote:


> I am trying to use Tesseract to OCR screen shots from various Windows
> applications.  So essentially the data is a random collection of letters
> and numbers, not written words/sentences like it was primarily oriented to
> handle.
>
> Here is my setup:
>
> -Tesseract Windows Version 5.0.0 from UB-Mannheim
> -image cleaning and resizing using openCV (have put much effort into
> getting this as good as I can)
> -parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much
> same results)
> -config file contents
>      language_model_penalty_non_dict_word 0.0
>      language_model_penalty_chartype 0.0
>      language_model_penalty_case 0.0
>      language_model_penalty_non_freq_dict_word 0.0
>
> Tesseract is performing reasonably well for my needs, but I have a couple
> of problems that I can't resolve.  They seem to be related to Tesseract
> functionality which tries to decide what a given character is not just
> based on its pixel layout, but also based on the context that the character
> occurs in.
>
> *Issue #1*
>
> Occasionally Tesseract inserts extra characters in its output, seemingly
> when it is unsure how to choose between a couple of different alternatives:
>
> [image: OneOfThree.png]
> For the above image, Tesseract produces the following output:
>
> 10of3
>
> As you can see, Tesseract inserts the digit 0 in front of the lower case
> letter o in the output.  It also ignores the white space in the image.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/1465
>
> *Issue #2*
>
> As shown in the above example, Tesseract sometimes ignores white space
> which at least to my eye is big enough not to be missed.
>
> *Issue #3*
>
> Tesseract has a hard time dealing with random strings of alpha characters
> and digits mixed together in no particular order.  It has a tendency to
> output a digit when the previous character was a digit, and an alpha when
> the previous character was an alpha.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/733
>
>
> *Suggestion:*
>
> At least for my situation, it seems that the best thing would be if there
> were a definitive Tesseract option to interpret individual characters
> without reference to their context.  Since my data comes from screen shots,
> it is very clear and very consistent, and I would think that a
> character-by-character mode would work well.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUyfP5SvQsrK_DjB52ODH3QUFJQ4c%3DftF3RwDVr%3D78paA%40mail.gmail.com.

Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

Reply via email to