> My (probably naive) impression so far is that the box parser prefers > to chop a symbol as soon as it finds a match, instead of looking for > possibly longer matches.
I would guess it would be evident in the unicharset file, if '-' comes before '-->'. But as far as I know the unicharset file depends on the training data and has to be in the same order -- then again leaves a bit space for ordering the training data for your own needs, i.e. having the first occurence of an arrow before a line. But this would be the long way around, sorry I can't help you. > This appears to be the main source of > inaccuracy for me (otherwise tesseract is great, btw). > When the letters in a word are being boxed, if the first box is > incorrectly placed, then the remaining boxes tend to be badly placed > as well to prevent gaps, at least that's what it looks like from > examining the boxfiles. Yes, I have noticed it also, haven't bothered correcting them either, more than occasionally lifting the upper border. Then again I guess the boxes are not the final result in the training, the tesseract box.train and mftraining/cmtraining seems to adapt the box data a bit more. Am I wrong in thinking that the boxes are just for visualizing, and that tesseract uses some kind of other structure internally? I'm no code hacker here ... --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

