Have you tried to ocr it character by character, using appropriate psm. On Tue, Oct 29, 2019, 09:42 Dave Wood <[email protected]> wrote:
> I am trying to use Tesseract to OCR screen shots from various Windows > applications. So essentially the data is a random collection of letters > and numbers, not written words/sentences like it was primarily oriented to > handle. > > Here is my setup: > > -Tesseract Windows Version 5.0.0 from UB-Mannheim > -image cleaning and resizing using openCV (have put much effort into > getting this as good as I can) > -parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much > same results) > -config file contents > language_model_penalty_non_dict_word 0.0 > language_model_penalty_chartype 0.0 > language_model_penalty_case 0.0 > language_model_penalty_non_freq_dict_word 0.0 > > Tesseract is performing reasonably well for my needs, but I have a couple > of problems that I can't resolve. They seem to be related to Tesseract > functionality which tries to decide what a given character is not just > based on its pixel layout, but also based on the context that the character > occurs in. > > *Issue #1* > > Occasionally Tesseract inserts extra characters in its output, seemingly > when it is unsure how to choose between a couple of different alternatives: > > [image: OneOfThree.png] > For the above image, Tesseract produces the following output: > > 10of3 > > As you can see, Tesseract inserts the digit 0 in front of the lower case > letter o in the output. It also ignores the white space in the image. > > Others have reported this issue, for example the thread below: > > https://github.com/tesseract-ocr/tesseract/issues/1465 > > *Issue #2* > > As shown in the above example, Tesseract sometimes ignores white space > which at least to my eye is big enough not to be missed. > > *Issue #3* > > Tesseract has a hard time dealing with random strings of alpha characters > and digits mixed together in no particular order. It has a tendency to > output a digit when the previous character was a digit, and an alpha when > the previous character was an alpha. > > Others have reported this issue, for example the thread below: > > https://github.com/tesseract-ocr/tesseract/issues/733 > > > *Suggestion:* > > At least for my situation, it seems that the best thing would be if there > were a definitive Tesseract option to interpret individual characters > without reference to their context. Since my data comes from screen shots, > it is very clear and very consistent, and I would think that a > character-by-character mode would work well. > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUyfP5SvQsrK_DjB52ODH3QUFJQ4c%3DftF3RwDVr%3D78paA%40mail.gmail.com.

