I am trying to recognize a flawless image. I created the image from a pdf 
that is all vector, not image. It has no noise, no skew, flawless 
characters in any DPI that I want.


The recognition from Tesseract sucks. Generally the problem is dropped 
characters. It seems to randomly ignore perfectly good looking characters.


The screen shot shows the text results in the upper left and the image in 
the background (only the upper left of the image is visible). The bounding 
boxes of the results are shown in red on that image. Notice all the missing 
characters. On this particular image all the characters to the right of 
what you can see are found and recognized properly. The image consists of a 
table of information (rows of item #, size, description, and qty). The 
columns are not nicely aligned (although this example is pretty good). Some 
rows are separated by a line (this example has a line for each row, and 
notice that tesseract gives me a bounding box for some of the lines, but 
not all). I tried removing the lines, but that just changed the set of 
dropped characters with no rhyme or reason to it. Other images from this 
same set are very similar but tesseract will drop characters on the right, 
or whole lines will be missing. I have tried different DPI from 75 to 300, 
but the results were just as disappointing.


Can anyone suggest how this might be solved?

<https://lh3.googleusercontent.com/-YwT5YW2wYGo/VuBLmZ-_lSI/AAAAAAAAAZ8/FhfW1gGg_8g/s1600/BadOCR.png>

<https://lh3.googleusercontent.com/-ER5AgyxXtY4/VuBLtP6wWvI/AAAAAAAAAaA/1Lxb767Xiqs/s1600/foo700219.png>



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to