[tesseract-ocr] Re: Force Tesseract to do individual character OCR only

Dave Wood Tue, 29 Oct 2019 15:38:45 -0700

My Issue#2 above is the case when Tesseract does not separate items which 
to my eye at least are far enough apart to be considered separate.   I have 
captured the full list of Tesseract configuration parameters and there are 
many of them that deal with the issue of spacing.  However, there are too 
many of them for me to figure out which ones might be relevant for dealing 
with the example I have provided.  Is there anybody out there who could 
give me some suggestions in this regard?


Thanks.

On Monday, October 28, 2019 at 9:12:34 PM UTC-7, Dave Wood wrote:
>
> I am trying to use Tesseract to OCR screen shots from various Windows 
> applications.  So essentially the data is a random collection of letters 
> and numbers, not written words/sentences like it was primarily oriented to 
> handle.
>
> Here is my setup:
>
> -Tesseract Windows Version 5.0.0 from UB-Mannheim
> -image cleaning and resizing using openCV (have put much effort into 
> getting this as good as I can)
> -parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much 
> same results)
> -config file contents
>      language_model_penalty_non_dict_word 0.0
>      language_model_penalty_chartype 0.0
>      language_model_penalty_case 0.0
>      language_model_penalty_non_freq_dict_word 0.0
>
> Tesseract is performing reasonably well for my needs, but I have a couple 
> of problems that I can't resolve.  They seem to be related to Tesseract 
> functionality which tries to decide what a given character is not just 
> based on its pixel layout, but also based on the context that the character 
> occurs in.
>
> *Issue #1*
>
> Occasionally Tesseract inserts extra characters in its output, seemingly 
> when it is unsure how to choose between a couple of different alternatives:
>
> [image: OneOfThree.png]
> For the above image, Tesseract produces the following output:
>
> 10of3
>
> As you can see, Tesseract inserts the digit 0 in front of the lower case 
> letter o in the output.  It also ignores the white space in the image.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/1465
>
> *Issue #2*
>
> As shown in the above example, Tesseract sometimes ignores white space 
> which at least to my eye is big enough not to be missed.
>
> *Issue #3*
>
> Tesseract has a hard time dealing with random strings of alpha characters 
> and digits mixed together in no particular order.  It has a tendency to 
> output a digit when the previous character was a digit, and an alpha when 
> the previous character was an alpha.
>
> Others have reported this issue, for example the thread below:
>
> https://github.com/tesseract-ocr/tesseract/issues/733
>
>
> *Suggestion:*
>
> At least for my situation, it seems that the best thing would be if there 
> were a definitive Tesseract option to interpret individual characters 
> without reference to their context.  Since my data comes from screen shots, 
> it is very clear and very consistent, and I would think that a 
> character-by-character mode would work well.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3e532b2b-215f-4377-9d4e-b836aa532eed%40googlegroups.com.

[tesseract-ocr] Re: Force Tesseract to do individual character OCR only

Reply via email to