Hello Nick

After some days I came back here and I'm very surprised about your lots of 
posts.
Thanks for answering and taking the time.

I found another bug in the tool.
(As I received no answer here, I already posted it to the Issues: 
http://code.google.com/p/tesseract-ocr/issues/detail?id=1251 )
_____________________________________________________

Apart from the 4 bugs I described in the forum there is another one:

While the downloaded traineddata files distinguish between punctuation and 
non-punctuation unichars like:

Punctuation: !"#%&'()*,-./:;?@[\]_{}
Others     : $+<=>|~º®«

the unicharset_extractor tool returns ALL non-alphanumeric characters as 
punctuation unichars.


_____________________________________________________

I think all the problems that I described can easily be fixed except the 
min/max values.

And I still don't understand the basic question:
How can we ever write ONE Unicharset file with font metrics for a whole 
bunch of completely different and contradicting fonts ?
If there was one unicharset file per font, it would be easier.
But ONE Unicharset file with min/max values for 358 fonts seems completely 
unsane for me!
Did you know that the english and the spanish traineddata for 3.02 were 
trained with 358 fonts ?
https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY

There are fonts that put the "9" below the baseline and other that do not.
How do we ever write a Unicharset for such different fonts ?
It simply doesn't make sense to me.
_______________________________________________

Why does Tesseract need these min/max values at all ?
Wouldn't it be much more intelligent to store this information directly in 
the feature data ?
So each character brings the information about it's baseline, height etc, 
along with the training data ?
These values could be easier to auto-generate.
_______________________________________________

And the other thing that I absolutely don't understand:
You are investigating about this topic now.
But where are the people who know ?
Is this only Ray ?

Google is one of the richest companies on earth.
Are they not able to pay one of the persons who knows to write a 
documentation (at least part time) ?
One of the persons who work on the code will require let's say a month to 
write a good documentation about Tesseract, which currently is completely 
abandoned.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55b76182-8d4e-4efd-9379-e9f43623856b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to