Hello Nick After some days I came back here and I'm very surprised about your lots of posts. Thanks for answering and taking the time.
I found another bug in the tool. (As I received no answer here, I already posted it to the Issues: http://code.google.com/p/tesseract-ocr/issues/detail?id=1251 ) _____________________________________________________ Apart from the 4 bugs I described in the forum there is another one: While the downloaded traineddata files distinguish between punctuation and non-punctuation unichars like: Punctuation: !"#%&'()*,-./:;?@[\]_{} Others : $+<=>|~º®« the unicharset_extractor tool returns ALL non-alphanumeric characters as punctuation unichars. _____________________________________________________ I think all the problems that I described can easily be fixed except the min/max values. And I still don't understand the basic question: How can we ever write ONE Unicharset file with font metrics for a whole bunch of completely different and contradicting fonts ? If there was one unicharset file per font, it would be easier. But ONE Unicharset file with min/max values for 358 fonts seems completely unsane for me! Did you know that the english and the spanish traineddata for 3.02 were trained with 358 fonts ? https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY There are fonts that put the "9" below the baseline and other that do not. How do we ever write a Unicharset for such different fonts ? It simply doesn't make sense to me. _______________________________________________ Why does Tesseract need these min/max values at all ? Wouldn't it be much more intelligent to store this information directly in the feature data ? So each character brings the information about it's baseline, height etc, along with the training data ? These values could be easier to auto-generate. _______________________________________________ And the other thing that I absolutely don't understand: You are investigating about this topic now. But where are the people who know ? Is this only Ray ? Google is one of the richest companies on earth. Are they not able to pay one of the persons who knows to write a documentation (at least part time) ? One of the persons who work on the code will require let's say a month to write a good documentation about Tesseract, which currently is completely abandoned. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55b76182-8d4e-4efd-9379-e9f43623856b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

