On Nov 10, 8:34 pm, kristian k <[EMAIL PROTECTED]> wrote:
> How many is a few?
> For me it sounds that you should train a bit more, maybe with a file
> with mixed arrows and - - > together?

I estimate that I have about 30 instances of "-->", and similar
numbers for symbols like "|-->" and "|--" so far. I also have a large
number of instances of "-" and "." as well. I've been using sample
pages from my target documents
for training, and I can't create arbitrary training files.

My (probably naive) impression so far is that the box parser prefers
to chop a symbol as soon as it finds a match, instead of looking for
possibly longer matches. This appears to be the main source of
inaccuracy for me (otherwise tesseract is great, btw).
When the letters in a word are being boxed, if the first box is
incorrectly placed, then the remaining boxes tend to be badly placed
as well to prevent gaps, at least that's what it looks like from
examining the boxfiles.

The documents I want to read are typewritten, so the letters are not
connected, but good quality. Any blob that is connected can be safely
assumed to be a single symbol in this case.

> I'm training for a phonetical script, so I have quite many different,
> and longer, signs to deal with. Even though my biggest problem is
> exactly the opposite of yours, 'ga' is almost always recognized as a
> 'ea' with a bow underneath (which is a valid symbols elsewhere in the
> text)
> and also keep on getting the "box overlaps blob in labelled word"
> failure. Don't know what to do with that..

I also get this error in some of my training pages. I am not sure what
it means either :(

Thanks.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to