Thank you. That seems to fixed the dropped characters problem.
jt On Wednesday, March 9, 2016 at 10:20:27 AM UTC-8, zdenop wrote: > > SetPageSegMode and try PSM_SINGLE_BLOCK. > > See: > > https://github.com/tesseract-ocr/tesseract/wiki/APIExample#orientation-and-script-detection-osd-example > > https://github.com/tesseract-ocr/tesseract/blob/master/ccstruct/publictypes.h#L151 > > Zdenko > > On Wed, Mar 9, 2016 at 6:45 PM, 'John Taves' via tesseract-ocr < > [email protected] <javascript:>> wrote: > >> I am using the c# API and whatever default page segmentation happens. >> What tess variable[1] should I play with? >> >> [1]http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version >> >> jt >> >> On Wednesday, March 9, 2016 at 8:44:02 AM UTC-8, zdenop wrote: >>> >>> What page segmentation method[1] you used? >>> >>> [1] >>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method >>> >>> Zdenko >>> >>> On Wed, Mar 9, 2016 at 5:14 PM, 'John Taves' via tesseract-ocr < >>> [email protected]> wrote: >>> >>>> I am trying to recognize a flawless image. I created the image from a >>>> pdf that is all vector, not image. It has no noise, no skew, flawless >>>> characters in any DPI that I want. >>>> >>>> >>>> The recognition from Tesseract sucks. Generally the problem is dropped >>>> characters. It seems to randomly ignore perfectly good looking characters. >>>> >>>> >>>> The screen shot shows the text results in the upper left and the image >>>> in the background (only the upper left of the image is visible). The >>>> bounding boxes of the results are shown in red on that image. Notice all >>>> the missing characters. On this particular image all the characters to the >>>> right of what you can see are found and recognized properly. The image >>>> consists of a table of information (rows of item #, size, description, and >>>> qty). The columns are not nicely aligned (although this example is pretty >>>> good). Some rows are separated by a line (this example has a line for each >>>> row, and notice that tesseract gives me a bounding box for some of the >>>> lines, but not all). I tried removing the lines, but that just changed the >>>> set of dropped characters with no rhyme or reason to it. Other images from >>>> this same set are very similar but tesseract will drop characters on the >>>> right, or whole lines will be missing. I have tried different DPI from 75 >>>> to 300, but the results were just as disappointing. >>>> >>>> >>>> Can anyone suggest how this might be solved? >>>> >>>> >>>> <https://lh3.googleusercontent.com/-YwT5YW2wYGo/VuBLmZ-_lSI/AAAAAAAAAZ8/FhfW1gGg_8g/s1600/BadOCR.png> >>>> >>>> >>>> <https://lh3.googleusercontent.com/-ER5AgyxXtY4/VuBLtP6wWvI/AAAAAAAAAaA/1Lxb767Xiqs/s1600/foo700219.png> >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/365df293-c049-418f-8632-9bb64c080d32%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/365df293-c049-418f-8632-9bb64c080d32%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f6fd0606-32b4-4a94-8f15-08478a0a5fa2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

