Re: [tesseract-ocr] Dropped characters from perfect image

'John Taves' via tesseract-ocr Wed, 09 Mar 2016 09:45:46 -0800

I am using the c# API and whatever default page segmentation happens. What 
tess variable[1] should I play with?


[1]http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version

jt

On Wednesday, March 9, 2016 at 8:44:02 AM UTC-8, zdenop wrote:
>
> What page segmentation method[1] you used?
>
> [1] 
> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
>
> Zdenko
>
> On Wed, Mar 9, 2016 at 5:14 PM, 'John Taves' via tesseract-ocr <
> [email protected] <javascript:>> wrote:
>
>> I am trying to recognize a flawless image. I created the image from a pdf 
>> that is all vector, not image. It has no noise, no skew, flawless 
>> characters in any DPI that I want.
>>
>>
>> The recognition from Tesseract sucks. Generally the problem is dropped 
>> characters. It seems to randomly ignore perfectly good looking characters.
>>
>>
>> The screen shot shows the text results in the upper left and the image in 
>> the background (only the upper left of the image is visible). The bounding 
>> boxes of the results are shown in red on that image. Notice all the missing 
>> characters. On this particular image all the characters to the right of 
>> what you can see are found and recognized properly. The image consists of a 
>> table of information (rows of item #, size, description, and qty). The 
>> columns are not nicely aligned (although this example is pretty good). Some 
>> rows are separated by a line (this example has a line for each row, and 
>> notice that tesseract gives me a bounding box for some of the lines, but 
>> not all). I tried removing the lines, but that just changed the set of 
>> dropped characters with no rhyme or reason to it. Other images from this 
>> same set are very similar but tesseract will drop characters on the right, 
>> or whole lines will be missing. I have tried different DPI from 75 to 300, 
>> but the results were just as disappointing.
>>
>>
>> Can anyone suggest how this might be solved?
>>
>>
>> <https://lh3.googleusercontent.com/-YwT5YW2wYGo/VuBLmZ-_lSI/AAAAAAAAAZ8/FhfW1gGg_8g/s1600/BadOCR.png>
>>
>>
>> <https://lh3.googleusercontent.com/-ER5AgyxXtY4/VuBLtP6wWvI/AAAAAAAAAaA/1Lxb767Xiqs/s1600/foo700219.png>
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/365df293-c049-418f-8632-9bb64c080d32%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Dropped characters from perfect image

Reply via email to