On Nov 10, 8:34 pm, kristian k <[EMAIL PROTECTED]> wrote: > How many is a few? > For me it sounds that you should train a bit more, maybe with a file > with mixed arrows and - - > together?
I estimate that I have about 30 instances of "-->", and similar numbers for symbols like "|-->" and "|--" so far. I also have a large number of instances of "-" and "." as well. I've been using sample pages from my target documents for training, and I can't create arbitrary training files. My (probably naive) impression so far is that the box parser prefers to chop a symbol as soon as it finds a match, instead of looking for possibly longer matches. This appears to be the main source of inaccuracy for me (otherwise tesseract is great, btw). When the letters in a word are being boxed, if the first box is incorrectly placed, then the remaining boxes tend to be badly placed as well to prevent gaps, at least that's what it looks like from examining the boxfiles. The documents I want to read are typewritten, so the letters are not connected, but good quality. Any blob that is connected can be safely assumed to be a single symbol in this case. > I'm training for a phonetical script, so I have quite many different, > and longer, signs to deal with. Even though my biggest problem is > exactly the opposite of yours, 'ga' is almost always recognized as a > 'ea' with a bow underneath (which is a valid symbols elsewhere in the > text) > and also keep on getting the "box overlaps blob in labelled word" > failure. Don't know what to do with that.. I also get this error in some of my training pages. I am not sure what it means either :( Thanks. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

