SetPageSegMode  and try PSM_SINGLE_BLOCK.

See:
https://github.com/tesseract-ocr/tesseract/wiki/APIExample#orientation-and-script-detection-osd-example
https://github.com/tesseract-ocr/tesseract/blob/master/ccstruct/publictypes.h#L151

Zdenko

On Wed, Mar 9, 2016 at 6:45 PM, 'John Taves' via tesseract-ocr <
[email protected]> wrote:

> I am using the c# API and whatever default page segmentation happens. What
> tess variable[1] should I play with?
>
> [1]http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
>
> jt
>
> On Wednesday, March 9, 2016 at 8:44:02 AM UTC-8, zdenop wrote:
>>
>> What page segmentation method[1] you used?
>>
>> [1]
>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
>>
>> Zdenko
>>
>> On Wed, Mar 9, 2016 at 5:14 PM, 'John Taves' via tesseract-ocr <
>> [email protected]> wrote:
>>
>>> I am trying to recognize a flawless image. I created the image from a
>>> pdf that is all vector, not image. It has no noise, no skew, flawless
>>> characters in any DPI that I want.
>>>
>>>
>>> The recognition from Tesseract sucks. Generally the problem is dropped
>>> characters. It seems to randomly ignore perfectly good looking characters.
>>>
>>>
>>> The screen shot shows the text results in the upper left and the image
>>> in the background (only the upper left of the image is visible). The
>>> bounding boxes of the results are shown in red on that image. Notice all
>>> the missing characters. On this particular image all the characters to the
>>> right of what you can see are found and recognized properly. The image
>>> consists of a table of information (rows of item #, size, description, and
>>> qty). The columns are not nicely aligned (although this example is pretty
>>> good). Some rows are separated by a line (this example has a line for each
>>> row, and notice that tesseract gives me a bounding box for some of the
>>> lines, but not all). I tried removing the lines, but that just changed the
>>> set of dropped characters with no rhyme or reason to it. Other images from
>>> this same set are very similar but tesseract will drop characters on the
>>> right, or whole lines will be missing. I have tried different DPI from 75
>>> to 300, but the results were just as disappointing.
>>>
>>>
>>> Can anyone suggest how this might be solved?
>>>
>>>
>>> <https://lh3.googleusercontent.com/-YwT5YW2wYGo/VuBLmZ-_lSI/AAAAAAAAAZ8/FhfW1gGg_8g/s1600/BadOCR.png>
>>>
>>>
>>> <https://lh3.googleusercontent.com/-ER5AgyxXtY4/VuBLtP6wWvI/AAAAAAAAAaA/1Lxb767Xiqs/s1600/foo700219.png>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/8c27aca6-3a45-4c23-97af-676fc6b0b611%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/365df293-c049-418f-8632-9bb64c080d32%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/365df293-c049-418f-8632-9bb64c080d32%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zOCR5FWMCywA%3Dp5KoyGw0g7HNwODTX1a8nouuqsAp9nA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to