> My (probably naive) impression so far is that the box parser prefers
> to chop a symbol as soon as it finds a match, instead of looking for
> possibly longer matches.

I would guess it would be evident in the unicharset file, if '-' comes
before '-->'.
But as far as I know the unicharset file depends on the training data
and has to be in the same order -- then again leaves a bit space for
ordering the training data for your own needs, i.e. having the first
occurence of an arrow before a line. But this would be the long way
around, sorry I can't help you.

> This appears to be the main source of
> inaccuracy for me (otherwise tesseract is great, btw).
> When the letters in a word are being boxed, if the first box is
> incorrectly placed, then the remaining boxes tend to be badly placed
> as well to prevent gaps, at least that's what it looks like from
> examining the boxfiles.

Yes, I have noticed it also, haven't bothered correcting them either,
more than occasionally lifting the upper border.
Then again I guess the boxes are not the final result in the training,
the tesseract box.train and mftraining/cmtraining seems to adapt the
box data a bit more.
Am I wrong in thinking that the boxes are just for visualizing, and
that tesseract uses some kind of other structure internally? I'm no
code hacker here ...

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to