On May 21, 2:04 am, Nick White <[email protected]> wrote:
> Hi Galt,
>
> I've been suffering a very similar problem with some of the text I'm
> training, which has several diacritics above and below glyphs. It
> isn't infrequent to find quite a few lines of garbage which are some
> of the diacritics taking a line, which then causes the following and
> preceding lines to not include said diacritics.
>
> Switching to -psm 6 did help very significantly, but I'm not
> entirely sure why this would make a difference, and diacritics are
> still sometimes associated with the wrong line (though a lot less).
>

Yes, I have seen the same thing.

> How did you fix the problem in your case?

My hack for fixing some mis-interpreted high curly quotes
was to lower the troubled ones with gimp by 10 to 14px
until Tess started parsing the lines correctly. Luckily I only
had to do this about a 6 times on 4 different pages for my book.

> Also, can anybody explain
> why -psm 6 makes such a big difference? Does it ensure lines are at
> least a certain height, or is it something else?

My understanding is that -psm 6 tells Tess to expect a block of
text of uniform font and size.  Which can sometimes help Tess
avoid absurd hypotheses such as: perhaps these left curly quotes are
really
the number 66 in a different font/size ;)

But Tess seems to regard -psm 6 more as a suggestion than a hard
rule.

For instance, using -psm 6 does not seem to mean
that the baselines must all be exactly the same distance
from each other. Usually that flexibility is a good thing.
But occasionally Tess imagines things that arenʻt there.

> Thanks
>
> Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to