Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

Lorenzo Bolzani Wed, 30 Oct 2019 02:45:19 -0700

Hi,
first crop the white border around the text. In this way I get correct the
result.


Try this on a large batch of data and see what works best, no border, one
pixel border, etc.. Also try different text sizes, from 30 to 50, just
upscale the image.

If this does not help have a look here:

https://github.com/tesseract-ocr/tesseract/pull/2635
https://github.com/tesseract-ocr/tesseract/blob/84c410a8e30bd0ae589871b985f62e708a702fb1/src/ccmain/tesseractclass.h#L1078

I did not try these myself yet, I do not know the exact situation for the
5.x version you are using, but could be something to try.

I would try lstm_choice_mode 1 and 2 and lstm_choice_iterations a few
values above and below 5 (probably above is better)


I think a tesseract mode where characters are interpreted out of context is
not possible as the neural network uses the context to recognize the
characters, is not something you can switch off. The solution would be a
different model trained/fine tuned on randomly mixed text and not on real
words.



Lorenzo



Il giorno mar 29 ott 2019 alle ore 05:12 Dave Wood <
[email protected]> ha scritto:

> I am trying to use Tesseract to OCR screen shots from various Windows
> applications.  So essentially the data is a random collection of letters
> and numbers, not written words/sentences like it was primarily oriented to
> handle.
>
> Here is my setup:
>
> -Tesseract Windows Version 5.0.0 from UB-Mannheim
> -image cleaning and resizing using openCV (have put much effort into
> getting this as good as I can)
> -parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much
> same results)
> -config file contents
>      language_model_penalty_non_dict_word 0.0
>      language_model_penalty_chartype 0.0
>      language_model_penalty_case 0.0
>      language_model_penalty_non_freq_dict_word 0.0
>
> Tesseract is performing reasonably well for my needs, but I have a couple
> of problems that I can't resolve.  They seem to be related to Tesseract
> functionality which tries to decide what a given character is not just
> based on its pixel layout, but also based on the context that the character
> occurs in.
>
> *Issue #1*
>
> Occasionally Tesseract inserts extra characters in its output, seemingly
> when it is unsure how to choose between a couple of different alternatives:
>
> [image: OneOfThree.png]
> For the above image, Tesseract produces the following output:
>
> 10of3
>
> As you can see, Tesseract inserts the digit 0 in front of the lower case
> letter o in the output.  It also ignores the white space in the image.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/1465
>
> *Issue #2*
>
> As shown in the above example, Tesseract sometimes ignores white space
> which at least to my eye is big enough not to be missed.
>
> *Issue #3*
>
> Tesseract has a hard time dealing with random strings of alpha characters
> and digits mixed together in no particular order.  It has a tendency to
> output a digit when the previous character was a digit, and an alpha when
> the previous character was an alpha.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/733
>
>
> *Suggestion:*
>
> At least for my situation, it seems that the best thing would be if there
> were a definitive Tesseract option to interpret individual characters
> without reference to their context.  Since my data comes from screen shots,
> it is very clear and very consistent, and I would think that a
> character-by-character mode would work well.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a434e930-a53e-44e0-bfd7-a46385ea091a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxJruh2AQT0jwx5wrsB5kVLRNiqOp4gPxCf6sSrBLm%2BpQ%40mail.gmail.com.

Re: [tesseract-ocr] Force Tesseract to do individual character OCR only

Reply via email to