What page segmentation method[1] you used? [1] https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
Zdenko On Wed, Mar 9, 2016 at 5:14 PM, 'John Taves' via tesseract-ocr < [email protected]> wrote: > I am trying to recognize a flawless image. I created the image from a pdf > that is all vector, not image. It has no noise, no skew, flawless > characters in any DPI that I want. > > > The recognition from Tesseract sucks. Generally the problem is dropped > characters. It seems to randomly ignore perfectly good looking characters. > > > The screen shot shows the text results in the upper left and the image in > the background (only the upper left of the image is visible). The bounding > boxes of the results are shown in red on that image. Notice all the missing > characters. On this particular image all the characters to the right of > what you can see are found and recognized properly. The image consists of a > table of information (rows of item #, size, description, and qty). The > columns are not nicely aligned (although this example is pretty good). Some > rows are separated by a line (this example has a line for each row, and > notice that tesseract gives me a bounding box for some of the lines, but > not all). I tried removing the lines, but that just changed the set of > dropped characters with no rhyme or reason to it. Other images from this > same set are very similar but tesseract will drop characters on the right, > or whole lines will be missing. I have tried different DPI from 75 to 300, > but the results were just as disappointing. > > > Can anyone suggest how this might be solved? > > > <https://lh3.googleusercontent.com/-YwT5YW2wYGo/VuBLmZ-_lSI/AAAAAAAAAZ8/FhfW1gGg_8g/s1600/BadOCR.png> > > > <https://lh3.googleusercontent.com/-ER5AgyxXtY4/VuBLtP6wWvI/AAAAAAAAAaA/1Lxb767Xiqs/s1600/foo700219.png> > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y3Jn3sUVqEjbgLRf6baXvfcxCchMDGdjToEP-x4fk1xA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

