Re: [tesseract-ocr] Re: Force Tesseract to do individual character OCR only

Lorenzo Bolzani Thu, 31 Oct 2019 04:21:17 -0700

Hi Dave,
are you sure the parameters are being used? For example setting
lstm_choice_mode to an invalid number or lstm_choice_iterations to zero
should at least produce some errors. With lstm_choice_mode > 0 you should
get the extra matches in the HOCR.


About the boxes, these are a problem in decoding the neural network output.
Sometimes the overlap is big other times an individual characters is
fragmented in small parts, they are not so easy to detect and also to
decide which character to keep (the confidence is not completely reliable
as the decoding had problems).

You can see some more examples here in this issue I created some time ago
(for 4.0):

https://github.com/tesseract-ocr/tesseract/issues/1778

What you, or tesseract, could do is to analyse the boxes and fix these
problems. This is what I do with ugly custom code and I can fix most of
these problems even if sometimes I introduce some new errors too.
But those boxes are what tesseract just produced so processing these again
does not make much sense, it should simply generate them correctly in the
first place (like they tried to do in 4.1). This, I think, is why there is
no option to process those boxes again (the double letters are not
alternatives, are distinct letters. You can get alternatives choices but,
with lstm_choice_mode, but it is a different thing).

Did you try 4.1 version? 5.x is not released yet (but I do not expect big
differences). Did you try to crop the border?


Lorenzo


Il giorno mer 30 ott 2019 alle ore 19:32 Dave Wood <
wood.john.da...@gmail.com> ha scritto:

> Thanks for the response Lorenzo.
>
> I did try your suggestion about lstm_choice parameters, trying many
> combinations, but that didn't make any difference.  What would make sense
> to me would be if there was an option to tell Tesseract not to output
> multiple characters where the box overlaps significantly, like in the case
> of this small example.  If you look at the HOCR output from this image, you
> can see that the boxes for both the 'o' and the '0' overlap by 90%.
> Tesseract should have an option to tell it to output only the highest
> confidence level character as opposed to both choices.
>
> On Monday, October 28, 2019 at 9:12:34 PM UTC-7, Dave Wood wrote:
>>
>> I am trying to use Tesseract to OCR screen shots from various Windows
>> applications.  So essentially the data is a random collection of letters
>> and numbers, not written words/sentences like it was primarily oriented to
>> handle.
>>
>> Here is my setup:
>>
>> -Tesseract Windows Version 5.0.0 from UB-Mannheim
>> -image cleaning and resizing using openCV (have put much effort into
>> getting this as good as I can)
>> -parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much
>> same results)
>> -config file contents
>>      language_model_penalty_non_dict_word 0.0
>>      language_model_penalty_chartype 0.0
>>      language_model_penalty_case 0.0
>>      language_model_penalty_non_freq_dict_word 0.0
>>
>> Tesseract is performing reasonably well for my needs, but I have a couple
>> of problems that I can't resolve.  They seem to be related to Tesseract
>> functionality which tries to decide what a given character is not just
>> based on its pixel layout, but also based on the context that the character
>> occurs in.
>>
>> *Issue #1*
>>
>> Occasionally Tesseract inserts extra characters in its output, seemingly
>> when it is unsure how to choose between a couple of different alternatives:
>>
>> [image: OneOfThree.png]
>> For the above image, Tesseract produces the following output:
>>
>> 10of3
>>
>> As you can see, Tesseract inserts the digit 0 in front of the lower case
>> letter o in the output.  It also ignores the white space in the image.
>>
>> Others have reported this issue, for example the thread below:
>>
>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>
>> *Issue #2*
>>
>> As shown in the above example, Tesseract sometimes ignores white space
>> which at least to my eye is big enough not to be missed.
>>
>> *Issue #3*
>>
>> Tesseract has a hard time dealing with random strings of alpha characters
>> and digits mixed together in no particular order.  It has a tendency to
>> output a digit when the previous character was a digit, and an alpha when
>> the previous character was an alpha.
>>
>> Others have reported this issue, for example the thread below:
>>
>> https://github.com/tesseract-ocr/tesseract/issues/733
>>
>>
>> *Suggestion:*
>>
>> At least for my situation, it seems that the best thing would be if there
>> were a definitive Tesseract option to interpret individual characters
>> without reference to their context.  Since my data comes from screen shots,
>> it is very clear and very consistent, and I would think that a
>> character-by-character mode would work well.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6eee7de8-b364-4fa2-afbe-b8992b4ed050%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6eee7de8-b364-4fa2-afbe-b8992b4ed050%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwqm7eNgynpvRHre8gEPtF41zmR2fgs%3DA0RnDKGriE4SA%40mail.gmail.com.

Re: [tesseract-ocr] Re: Force Tesseract to do individual character OCR only

Reply via email to