On May 21, 2:04 am, Nick White <[email protected]> wrote: > Hi Galt, > > I've been suffering a very similar problem with some of the text I'm > training, which has several diacritics above and below glyphs. It > isn't infrequent to find quite a few lines of garbage which are some > of the diacritics taking a line, which then causes the following and > preceding lines to not include said diacritics. > > Switching to -psm 6 did help very significantly, but I'm not > entirely sure why this would make a difference, and diacritics are > still sometimes associated with the wrong line (though a lot less). >
Yes, I have seen the same thing. > How did you fix the problem in your case? My hack for fixing some mis-interpreted high curly quotes was to lower the troubled ones with gimp by 10 to 14px until Tess started parsing the lines correctly. Luckily I only had to do this about a 6 times on 4 different pages for my book. > Also, can anybody explain > why -psm 6 makes such a big difference? Does it ensure lines are at > least a certain height, or is it something else? My understanding is that -psm 6 tells Tess to expect a block of text of uniform font and size. Which can sometimes help Tess avoid absurd hypotheses such as: perhaps these left curly quotes are really the number 66 in a different font/size ;) But Tess seems to regard -psm 6 more as a suggestion than a hard rule. For instance, using -psm 6 does not seem to mean that the baselines must all be exactly the same distance from each other. Usually that flexibility is a good thing. But occasionally Tess imagines things that arenʻt there. > Thanks > > Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

